Communication-Efficient HPC

Cross-Cutting Optimization for Distributed Computing

The Communication Bottleneck

Across sparse solvers, graph algorithms, and tensor computations, the same fundamental challenge emerges: communication dominates runtime at scale. Modern supercomputers have abundant floating-point capability, but moving data between nodes, between CPU and GPU, or even between cache levels remains expensive.

This research thread develops cross-cutting techniques for reducing communication in distributed HPC applications.

graph TB
    subgraph Optimizations[Performance Optimizations]
        direction TB
        CA[Comm Avoidance]
        HET[Heterogeneous Accel]
        TRI[Sparse Triangular]
        MXP[Mixed Precision]
        MPI[MPI Optimization]
        SRI[Semiring Kernels]
    end

    subgraph Impact[Impact Areas]
        SCI[Scientific Simulations]
        KGA[Knowledge Graphs]
        ML[Machine Learning]
    end

    Optimizations --> Impact

    style Optimizations fill:#f0f7ff,stroke:#1a73e8,stroke-width:2px
    style CA fill:#e6f4ea,stroke:#137333
    style HET fill:#fef7e0,stroke:#e37400
    style TRI fill:#f3e8fd,stroke:#7627bb
    style MXP fill:#fce8e6,stroke:#d93025
    style MPI fill:#e8f5e9,stroke:#2e7d32
    style SRI fill:#fff3e0,stroke:#e65100

Communication-Avoiding 3D Algorithms

The central theme: arranging processes in a 3D grid and selectively replicating data yields asymptotic communication reductions.

  • 3D Sparse LU Factorization: O(√log n) volume and O(log n) latency reduction for planar graphs
  • 3D Sparse Triangular Solve: One-sided MPI with message-driven GPU parallelism
  • 3D APSP: Extended to distributed graph analytics

Heterogeneous Architecture Support

HALO: Highly Asynchronous Lazy Offload

Efficient CPU–GPU co-processing by using data replication strategies that reduce host-device communication. Mirrors the philosophy of 3D factorization applied to heterogeneous nodes.

Multi-Vendor GPU Support

Developed unified code paths supporting:

  • NVIDIA GPUs (CUDA)
  • AMD GPUs (HIP/ROCm)
  • Intel GPUs (SYCL/oneAPI)

Exascale MPI Optimization

Work on optimizing 2D grid-based MPI communication patterns at exascale (EuroMPI 2023), addressing bottlenecks in data exchange patterns common to both linear algebra and graph algorithms.

Community Standards

Co-authored the Interface for Sparse Linear Algebra Operations (2024), a community standard defining portable APIs for sparse LA across vendors and implementations. This 26-author effort establishes common interfaces that enable algorithm-library interoperability.

Key Publications

  • P. Sao, R. Kannan, X.S. Li, R. Vuduc. A communication-avoiding 3D sparse triangular solver. ICS 2019.
  • Y. Liu, N. Ding, P. Sao, S. Williams, X.S. Li. Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters. SC 2023.
  • H. Lu, P. Sao, M. Matheson, R. Kannan, F. Wang, T. Potok. Optimizing Communication in 2D Grid-Based MPI Applications at Exascale. EuroMPI 2023.
  • V. Thakkar, R. Kannan, P. Sao, et al. Dense semiring linear algebra on modern CUDA hardware. SIAM CSE 2021.
  • A. Abdelfattah, …, P. Sao, et al. Interface for sparse linear algebra operations. arXiv:2411.13259, 2024.

Connection to Other Projects

These optimization techniques are applied across:

  • SuperLU_DIST: 3D factorization and triangular solve
  • Knowledge Graphs: DSNAPSHOT and COAST communication patterns
  • ArborX: Distributed spatial queries
  • Sparsitute: Communication lower-bound theory

References