Activation Design from the Convergence-Radius Lens

Which activation functions introduce complex singularities that shrink the safe step size during training — and which don't? An interactive companion to Ghosts of Softmax (Appendices 2–3).

Piyush Sao · Oak Ridge National Laboratory · github.com/piyush314/ghosts-of-softmax

01 The per-neuron radius

Every activation function \( \varphi \) has a set of complex singularities \( \Sigma_\varphi \). A neuron with preactivation \( h \) evolving as \( h(t) = h + t\dot h \) along the update direction hits its nearest singularity at distance:

\[ \rho_j = \min_{s \in \Sigma_\varphi} \frac{|s - h_j|}{|\dot h_j|} \]

The overall convergence radius cannot exceed \( \min_j \rho_j \). Combined with the softmax ghost radius \( \rho_a = \pi/\Delta_a \), the conservative bound becomes \( \min(\min_j \rho_j, \rho_a) \).

So the question reduces to: where is \( \Sigma_\varphi \)?

02 Activation singularities at a glance

Ranked by distance from the real axis to the nearest complex singularity. Further = better for convergence-radius safety.

Activation	Singularity set Σ_φ	Distance	Notes
1Exact GELU x · Φ(x), erf-based	Σ = ∅ (entire)	∞	No finite singularity. Structurally best. Non-monotone derivative (slight dip).
1RIA ∫ Φ(βt) dt = ReLU ⊛ Gaussian	Σ = ∅ (entire)	∞	Entire, strictly convex, monotone derivative. Recovers ReLU as β → ∞.
1Exact GeGLU GELU-gated linear unit	Σ = ∅ (entire)	∞	Gated variant with entire gate Φ(x). Radius-clean analogue of SwiGLU.
1GaussGLU (W_vx) ⊙ Φ(βW_gx)	Σ = ∅ (entire)	∞	Paper's proposed gated FFN. Gaussian CDF gate — no logistic poles.
2SiLU / Swish x · σ(x)	z = iπ(2k+1)	π ≈ 3.14	Logistic poles at odd multiples of iπ. Same ghost lattice as softmax.
2Softplus log(1 + e^x)	z = iπ(2k+1)	π ≈ 3.14	Branch points where 1+e^z = 0. Identical lattice to sigmoid family.
2Sigmoid 1 / (1 + e^−x)	z = iπ(2k+1)	π ≈ 3.14	Poles where 1+e^−z=0. Derivatives used in many gates.
2SwiGLU Swish-gated linear unit	z = iπ(2k+1)	π ≈ 3.14	Inherits logistic poles from its Swish gate. Widely used in modern LLMs.
3Tanh sinh z / cosh z	z = i(π/2 + πk)	π/2 ≈ 1.57	Poles where cosh z = 0. Strictly closer than sigmoid — 2× worse.
3Tanh-approx GELU ½x(1+tanh(√(2/π)(x+0.044x³)))	z = i(π/2 + πk)	π/2 ≈ 1.57	Caution: reintroduces tanh poles that exact GELU avoids. Common in legacy code.
4ReLU max(0, x)	z = 0 (kink on ℝ)	\|h\|/\|ḣ\| → 0	Not analytic at x = 0. Kink on real axis. Distance to singularity can be arbitrarily small.
4Leaky ReLU / PReLU max(αx, x)	z = 0 (kink on ℝ)	\|h\|/\|ḣ\| → 0	Same kink structure as ReLU. Piecewise linear, not analytic at breakpoint.
4Hard-Swish / Hard-Sigmoid Piecewise linear approximations	Multiple kinks on ℝ	min \|h−b\|/\|ḣ\|	Multiple breakpoints. Structurally worst class from analyticity perspective.
4ReGLU ReLU-gated linear unit	z = 0 (gate kink on ℝ)	\|h\|/\|ḣ\| → 0	Gated variant inherits ReLU kink in the gate branch.

Entire ≠ unbounded step. A nonconstant entire function like \( \mathrm{erf}(z) \) grows as \( \exp(z^2) \) off the real axis. "Infinite activation radius" means the activation contributes no finite singularity barrier — the softmax radius \( \rho_a \) and strip growth rate remain as separate constraints.

03 RIA vs GELU — interactive comparison

Both are entire (tier 1), but RIA adds strict convexity and a monotone derivative. Drag the \( \beta \) slider to control RIA's sharpness — at \( \beta \to \infty \) it recovers ReLU.

\[ \varphi_\beta(x) = x\Phi(\beta x) + \frac{\phi(\beta x)}{\beta}, \qquad \varphi_\beta'(x) = \Phi(\beta x) \in (0,1), \qquad \varphi_\beta''(x) = \beta \phi(\beta x) > 0 \] \[ \mathrm{GELU}(x) = x\Phi(x), \qquad \mathrm{GELU}'(x) = \Phi(x) + x\phi(x), \qquad \mathrm{GELU}''(x) < 0 \quad (x \approx -1.5) \]

Activations & their derivatives

RIA β 2.0

RIA_β GELU ReLU Softplus

RIA′ = Φ(βx) — monotone, always in (0,1) GELU′ — dips below zero

RIA″ = β·φ(βx) > 0 — strictly convex GELU″ — goes negative (non-convex pocket)

04 Design principles

From the convergence-radius perspective, an ideal activation would:

1. Entire — no finite complex singularities (or at least far from ℝ)
2. ReLU-like on ℝ — suppress negative side, linear on positive
3. Monotone derivative — no dead regions or non-monotonicity
4. Controlled strip growth — moderate growth for z = x + iy in the relevant |y| range
5. Cheap & stable — avoid tanh approximations that reintroduce poles

Exact vs approximate GELU matters. The tanh approximation \( \frac{1}{2}x(1 + \tanh(\sqrt{2/\pi}(x + 0.044715x^3))) \) reintroduces poles at \( i(\pi/2 + \pi k) \), dropping from tier 1 (\( \infty \)) to tier 3 (\( \pi/2 \approx 1.57 \)). Many frameworks now provide exact erf-based GELU, but legacy codebases may still use the tanh shortcut.