Fault-Tolerant Algorithms
Self-Stabilizing Algorithms for Resilient HPC
The Exascale Reliability Problem
As HPC systems scale toward exascale, the probability of hardware faults—particularly transient soft faults like bit flips—increases significantly. A system with millions of components running for hours will experience faults. Traditional checkpoint-restart is expensive; can algorithms themselves be designed to tolerate faults?
This research develops self-stabilizing algorithms: algorithms that, regardless of whether they start from a valid or corrupted state, are guaranteed to converge to a correct state after a finite number of steps.
Self-Stabilization Framework
stateDiagram-v2
direction LR
state "Valid State" as VS
state "Invalid State<br/>(fault-corrupted)" as IS
state "Self-Stabilizing<br/>Verification &<br/>Correction" as SS
state "Recovered<br/>Valid State" as RV
[*] --> VS : Normal execution
VS --> IS : Soft fault<br/>(bit flip)
IS --> SS : Detect invalid state
SS --> RV : Correct labels
RV --> VS : Continue execution
VS --> VS : No fault
The key insight: by carefully analyzing what constitutes a valid state during algorithm execution, we can design verification and correction procedures that detect and fix corruption without full restart.
Self-Stabilizing Connected Components
For the fundamental problem of computing connected components:
- Developed fault-tolerant variant of label propagation
- Comprehensive analysis of valid vs. invalid states
- Adds only O(V log V) computation and O(V) storage over conventional algorithm
- More resilient than triple modular redundancy (TMR) in 80% of test cases
Self-Stabilizing Iterative Solvers
Extended self-stabilization to numerical linear algebra:
- Iterative solvers that detect soft faults during computation
- Recovery without losing convergence progress
- Applicable to CG, GMRES, and other Krylov methods
Resilience Design Patterns
Co-authored a comprehensive technical report documenting resilience design patterns for extreme-scale systems—providing a structured taxonomy for building fault-tolerant HPC applications.
Why This Matters
| Approach | Overhead | Recovery Time | Coverage |
|---|---|---|---|
| Checkpoint-Restart | High (I/O) | Minutes | Complete |
| Triple Modular Redundancy | 3× compute | Immediate | Hardware faults |
| Self-Stabilization | Low (O(V)) | Algorithm-dependent | Soft faults |
Self-stabilization provides a middle ground: low overhead with automatic recovery for the most common fault types at scale.
Key Publications
- P. Sao, C. Engelmann, S. Eswar, O. Green, R. Vuduc. Self-stabilizing Connected Components. FTXS 2019.
- P. Sao, R. Vuduc. Self-stabilizing Iterative Solvers. ScalA Workshop, SC 2013.
- P. Sao, O. Green, C. Jain, R. Vuduc. A Self-Correcting Connected Components Algorithm. FTXS 2016.
- C. Engelmann, R. Ashraf, S. Hukerikar, M. Kumar, P. Sao. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale. ORNL Tech Report, 2022.