SuperLU_DIST
Communication-Avoiding 3D Sparse Direct Solver
The Sparse Direct Solver Challenge
Sparse direct solvers are the workhorses of scientific computing—solving the linear systems that arise in structural mechanics, circuit simulation, and computational fluid dynamics. But as systems scale to thousands of processors, communication becomes the bottleneck: traditional 2D algorithms spend more time exchanging data than computing.
SuperLU_DIST is a widely used, open-source library for solving large sparse linear systems A·X = B using Gaussian elimination with static pivoting (GESP). It combines the numerical stability of partial pivoting with the scalability of no-pivoting approaches.
The 3D Communication-Avoiding Algorithm
My key contribution is the communication-avoiding 3D factorization algorithm, which fundamentally restructures how sparse LU factorization distributes work across processors.
graph LR
subgraph Input["Input"]
A["Sparse Matrix A"]
end
subgraph Reorder["Phase 1: Ordering & Analysis"]
ORD["Fill-reducing<br/>Ordering<br/>(ParMETIS)"]
SYM["Symbolic<br/>Factorization"]
ETREE["Elimination<br/>Tree Construction"]
end
subgraph Factor["Phase 2: Numerical Factorization"]
direction TB
P2D["Baseline 2D<br/>Block-Cyclic<br/>Distribution"]
P3D["<b>3D Communication-<br/>Avoiding Algorithm</b><br/>(Sao et al.)"]
GPU["Multi-GPU<br/>Offload<br/>(NVIDIA/AMD/Intel)"]
P3D --- GPU
end
subgraph Solve["Phase 3: Triangular Solve"]
TRI["Sparse Triangular<br/>Solve L·U·X = B"]
end
A --> ORD --> SYM --> ETREE --> Factor --> TRI
style P3D fill:#d4edda,stroke:#155724,stroke-width:2px
style Factor fill:#f0f7ff,stroke:#1a73e8
The algorithm uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades increased memory for dramatically reduced per-process communication. Released as the recommended code path starting with Version 9.0.0.
Performance Results
| Configuration | Planar Graphs | Non-Planar Graphs |
|---|---|---|
| 24,000 cores (Cray XC30) | 27× speedup | 3.3× speedup |
| 4,096 nodes + GPUs (Cray XK7) | 24× speedup | 3.5× speedup |
For a planar graph with n vertices, the 3D algorithm achieves:
- Communication volume reduction by O(√log n)
- Latency reduction by O(log n)
Additional Contributions
- HALO (Highly Asynchronous Lazy Offload) — CPU–GPU co-processing scheme that reduces host-device communication
- Multi-GPU support — NVIDIA, AMD, and Intel GPUs
- Mixed-precision routines — Single-precision factorization with double-precision iterative refinement
- 3D Sparse Triangular Solve — Communication-avoiding triangular solver using one-sided MPI
Key Publications
- P. Sao, X.S. Li, R. Vuduc. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems. JPDC, 2019.
- P. Sao, X.S. Li, R. Vuduc. A communication-avoiding 3D LU factorization algorithm for sparse matrices. IPDPS 2018.
- X.S. Li, P. Lin, Y. Liu, P. Sao. Newly released capabilities in the distributed-memory SuperLU sparse direct solver. ACM TOMS, 2023.