Communication-Efficient HPC

The Communication Bottleneck

Across sparse solvers, graph algorithms, and tensor computations, the same fundamental challenge emerges: communication dominates runtime at scale. Modern supercomputers have abundant floating-point capability, but moving data between nodes, between CPU and GPU, or even between cache levels remains expensive.

This research thread develops cross-cutting techniques for reducing communication in distributed HPC applications.

graph TB
    subgraph Optimizations[Performance Optimizations]
        direction TB
        CA[Comm Avoidance]
        HET[Heterogeneous Accel]
        TRI[Sparse Triangular]
        MXP[Mixed Precision]
        MPI[MPI Optimization]
        SRI[Semiring Kernels]
    end

    subgraph Impact[Impact Areas]
        SCI[Scientific Simulations]
        KGA[Knowledge Graphs]
        ML[Machine Learning]
    end

    Optimizations --> Impact

    style Optimizations fill:#f0f7ff,stroke:#1a73e8,stroke-width:2px
    style CA fill:#e6f4ea,stroke:#137333
    style HET fill:#fef7e0,stroke:#e37400
    style TRI fill:#f3e8fd,stroke:#7627bb
    style MXP fill:#fce8e6,stroke:#d93025
    style MPI fill:#e8f5e9,stroke:#2e7d32
    style SRI fill:#fff3e0,stroke:#e65100

Communication-Avoiding 3D Algorithms

The central theme: arranging processes in a 3D grid and selectively replicating data yields asymptotic communication reductions.

3D Sparse LU Factorization: O(√log n) volume and O(log n) latency reduction for planar graphs
3D Sparse Triangular Solve: One-sided MPI with message-driven GPU parallelism
3D APSP: Extended to distributed graph analytics

Heterogeneous Architecture Support

HALO: Highly Asynchronous Lazy Offload

Efficient CPU–GPU co-processing by using data replication strategies that reduce host-device communication. Mirrors the philosophy of 3D factorization applied to heterogeneous nodes.

Multi-Vendor GPU Support

Developed unified code paths supporting:

NVIDIA GPUs (CUDA)
AMD GPUs (HIP/ROCm)
Intel GPUs (SYCL/oneAPI)

Exascale MPI Optimization

Work on optimizing 2D grid-based MPI communication patterns at exascale (EuroMPI 2023), addressing bottlenecks in data exchange patterns common to both linear algebra and graph algorithms.

Community Standards

Co-authored the Interface for Sparse Linear Algebra Operations (2024), a community standard defining portable APIs for sparse LA across vendors and implementations. This 26-author effort establishes common interfaces that enable algorithm-library interoperability.

Key Publications

P. Sao, R. Kannan, X.S. Li, R. Vuduc. A communication-avoiding 3D sparse triangular solver. ICS 2019.
Y. Liu, N. Ding, P. Sao, S. Williams, X.S. Li. Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters. SC 2023.
H. Lu, P. Sao, M. Matheson, R. Kannan, F. Wang, T. Potok. Optimizing Communication in 2D Grid-Based MPI Applications at Exascale. EuroMPI 2023.
V. Thakkar, R. Kannan, P. Sao, et al. Dense semiring linear algebra on modern CUDA hardware. SIAM CSE 2021.
A. Abdelfattah, …, P. Sao, et al. Interface for sparse linear algebra operations. arXiv:2411.13259, 2024.

Connection to Other Projects

These optimization techniques are applied across:

SuperLU_DIST: 3D factorization and triangular solve
Knowledge Graphs: DSNAPSHOT and COAST communication patterns
ArborX: Distributed spatial queries
Sparsitute: Communication lower-bound theory