Activation Design from the Convergence-Radius Lens

Which activation functions introduce complex singularities that shrink the safe step size during training — and which don't? An interactive companion to Ghosts of Softmax (Appendices 2–3).

Piyush Sao · Oak Ridge National Laboratory · github.com/piyush314/ghosts-of-softmax

01 The per-neuron radius

Every activation function φ has a set of complex singularities Σφ. A neuron with preactivation h evolving as h(t) = h + tḣ along the update direction hits its nearest singularity at distance:

ρj = mins ∈ Σφ |s − hj| / |ḣj|

The overall convergence radius cannot exceed minj ρj. Combined with the softmax ghost radius ρa = π/Δa, the conservative bound becomes min(minj ρj, ρa).

So the question reduces to: where is Σφ?

02 Activation singularities at a glance

Ranked by distance from the real axis to the nearest complex singularity. Further = better for convergence-radius safety.

Activation Singularity set Σφ Distance Notes
1Exact GELU
x · Φ(x), erf-based
Σ = ∅ (entire) No finite singularity. Structurally best. Non-monotone derivative (slight dip).
1RIA
∫ Φ(βt) dt = ReLU ⊛ Gaussian
Σ = ∅ (entire) Entire, strictly convex, monotone derivative. Recovers ReLU as β → ∞.
1Exact GeGLU
GELU-gated linear unit
Σ = ∅ (entire) Gated variant with entire gate Φ(x). Radius-clean analogue of SwiGLU.
1GaussGLU
(Wvx) ⊙ Φ(βWgx)
Σ = ∅ (entire) Paper's proposed gated FFN. Gaussian CDF gate — no logistic poles.
2SiLU / Swish
x · σ(x)
z = iπ(2k+1) π ≈ 3.14 Logistic poles at odd multiples of iπ. Same ghost lattice as softmax.
2Softplus
log(1 + ex)
z = iπ(2k+1) π ≈ 3.14 Branch points where 1+ez = 0. Identical lattice to sigmoid family.
2Sigmoid
1 / (1 + e−x)
z = iπ(2k+1) π ≈ 3.14 Poles where 1+e−z=0. Derivatives used in many gates.
2SwiGLU
Swish-gated linear unit
z = iπ(2k+1) π ≈ 3.14 Inherits logistic poles from its Swish gate. Widely used in modern LLMs.
3Tanh
sinh z / cosh z
z = i(π/2 + πk) π/2 ≈ 1.57 Poles where cosh z = 0. Strictly closer than sigmoid — 2× worse.
3Tanh-approx GELU
½x(1+tanh(√(2/π)(x+0.044x³)))
z = i(π/2 + πk) π/2 ≈ 1.57 Caution: reintroduces tanh poles that exact GELU avoids. Common in legacy code.
4ReLU
max(0, x)
z = 0 (kink on ℝ) |h|/|ḣ| → 0 Not analytic at x = 0. Kink on real axis. Distance to singularity can be arbitrarily small.
4Leaky ReLU / PReLU
max(αx, x)
z = 0 (kink on ℝ) |h|/|ḣ| → 0 Same kink structure as ReLU. Piecewise linear, not analytic at breakpoint.
4Hard-Swish / Hard-Sigmoid
Piecewise linear approximations
Multiple kinks on ℝ min |h−b|/|ḣ| Multiple breakpoints. Structurally worst class from analyticity perspective.
4ReGLU
ReLU-gated linear unit
z = 0 (gate kink on ℝ) |h|/|ḣ| → 0 Gated variant inherits ReLU kink in the gate branch.
Entire ≠ unbounded step. A nonconstant entire function like erf(z) grows as exp(z²) off the real axis. "Infinite activation radius" means the activation contributes no finite singularity barrier — the softmax radius ρa and strip growth rate remain as separate constraints.

03 RIA vs GELU — interactive comparison

Both are entire (tier 1), but RIA adds strict convexity and a monotone derivative. Drag the β slider to control RIA's sharpness — at β → ∞ it recovers ReLU.

RIA:  φβ(x) = x·Φ(βx) + φ(βx)/β    φ′ = Φ(βx) ∈ (0,1)    φ″ = β·φ(βx) > 0
GELU: φ(x) = x·Φ(x)                 φ′ = Φ(x) + x·φ(x)   φ″ goes negative near x ≈ −1.5
Activations & their derivatives
2.0
RIAβ GELU ReLU Softplus
RIA′ = Φ(βx) — monotone, always in (0,1) GELU′ — dips below zero
RIA″ = β·φ(βx) > 0 — strictly convex GELU″ — goes negative (non-convex pocket)

04 Design principles

From the convergence-radius perspective, an ideal activation would:

1. Entire — no finite complex singularities (or at least far from ℝ)
2. ReLU-like on ℝ — suppress negative side, linear on positive
3. Monotone derivative — no dead regions or non-monotonicity
4. Controlled strip growth — moderate growth for z = x + iy in the relevant |y| range
5. Cheap & stable — avoid tanh approximations that reintroduce poles
Exact vs approximate GELU matters. The tanh approximation ½x(1 + tanh(√(2/π)(x + 0.044715x³))) reintroduces poles at i(π/2 + πk) — dropping from tier 1 (∞) to tier 3 (π/2 ≈ 1.57). Many frameworks now provide exact erf-based GELU, but legacy codebases may still use the tanh shortcut.