Which activation functions introduce complex singularities that shrink the safe step size during training — and which don't? An interactive companion to Ghosts of Softmax (Appendices 2–3).
Every activation function φ has a set of complex singularities Σφ. A neuron with preactivation h evolving as h(t) = h + tḣ along the update direction hits its nearest singularity at distance:
The overall convergence radius cannot exceed minj ρj. Combined with the softmax ghost radius ρa = π/Δa, the conservative bound becomes min(minj ρj, ρa).
So the question reduces to: where is Σφ?
Ranked by distance from the real axis to the nearest complex singularity. Further = better for convergence-radius safety.
| Activation | Singularity set Σφ | Distance | Notes |
|---|---|---|---|
|
1Exact GELU
x · Φ(x), erf-based
|
Σ = ∅ (entire) | ∞ | No finite singularity. Structurally best. Non-monotone derivative (slight dip). |
|
1RIA
∫ Φ(βt) dt = ReLU ⊛ Gaussian
|
Σ = ∅ (entire) | ∞ | Entire, strictly convex, monotone derivative. Recovers ReLU as β → ∞. |
|
1Exact GeGLU
GELU-gated linear unit
|
Σ = ∅ (entire) | ∞ | Gated variant with entire gate Φ(x). Radius-clean analogue of SwiGLU. |
|
1GaussGLU
(Wvx) ⊙ Φ(βWgx)
|
Σ = ∅ (entire) | ∞ | Paper's proposed gated FFN. Gaussian CDF gate — no logistic poles. |
|
2SiLU / Swish
x · σ(x)
|
z = iπ(2k+1) | π ≈ 3.14 | Logistic poles at odd multiples of iπ. Same ghost lattice as softmax. |
|
2Softplus
log(1 + ex)
|
z = iπ(2k+1) | π ≈ 3.14 | Branch points where 1+ez = 0. Identical lattice to sigmoid family. |
|
2Sigmoid
1 / (1 + e−x)
|
z = iπ(2k+1) | π ≈ 3.14 | Poles where 1+e−z=0. Derivatives used in many gates. |
|
2SwiGLU
Swish-gated linear unit
|
z = iπ(2k+1) | π ≈ 3.14 | Inherits logistic poles from its Swish gate. Widely used in modern LLMs. |
|
3Tanh
sinh z / cosh z
|
z = i(π/2 + πk) | π/2 ≈ 1.57 | Poles where cosh z = 0. Strictly closer than sigmoid — 2× worse. |
|
3Tanh-approx GELU
½x(1+tanh(√(2/π)(x+0.044x³)))
|
z = i(π/2 + πk) | π/2 ≈ 1.57 | Caution: reintroduces tanh poles that exact GELU avoids. Common in legacy code. |
|
4ReLU
max(0, x)
|
z = 0 (kink on ℝ) | |h|/|ḣ| → 0 | Not analytic at x = 0. Kink on real axis. Distance to singularity can be arbitrarily small. |
|
4Leaky ReLU / PReLU
max(αx, x)
|
z = 0 (kink on ℝ) | |h|/|ḣ| → 0 | Same kink structure as ReLU. Piecewise linear, not analytic at breakpoint. |
|
4Hard-Swish / Hard-Sigmoid
Piecewise linear approximations
|
Multiple kinks on ℝ | min |h−b|/|ḣ| | Multiple breakpoints. Structurally worst class from analyticity perspective. |
|
4ReGLU
ReLU-gated linear unit
|
z = 0 (gate kink on ℝ) | |h|/|ḣ| → 0 | Gated variant inherits ReLU kink in the gate branch. |
Both are entire (tier 1), but RIA adds strict convexity and a monotone derivative. Drag the β slider to control RIA's sharpness — at β → ∞ it recovers ReLU.
From the convergence-radius perspective, an ideal activation would: