Analytic Normalization: Removing the Singularity at Zero

LayerNorm and RMSNorm divide by √v — which has a branch-point singularity at v = 0. The Weierstrass transform (Gaussian convolution) replaces it with an entire function, eliminating the finite ghost barrier. An interactive companion to Ghosts of Softmax, Appendix 3.

Piyush Sao · Oak Ridge National Laboratory · github.com/piyush314/ghosts-of-softmax

01 The singularity in standard normalization

LayerNorm and RMSNorm both rely on a scale factor of the form f(v) = 1/√v, where v is a variance or mean-square quantity computed from activations. On the positive reals this is perfectly well-behaved. But in the complex plane, 1/√v has a branch-point singularity at v = 0.

During the forward pass, v is always positive. But from the convergence-radius perspective, the singularity at v = 0 is still relevant — it limits the Taylor convergence radius of any function that depends on 1/√v. If a neuron's variance v evolves as v(t) = v₀ + ṫv along the update direction, the singularity lies at distance |v₀|/|v̇| from the current point. For neurons with small variance (common in normalized networks), this can be very close.

Standard:  f(v) = 1/√v    →  branch point at v = 0    →  ρ = |v₀|/|v̇|
Smoothed:σ(v) = Weierstrassσ[1/√max(0,·)](v)    →  entire    →  ρ = ∞
Why this matters: In normalized networks, many preactivations cluster near zero. The normalization singularity at v = 0 can become the binding constraint on ρ — tighter than the softmax ghosts at distance π/Δa. The tiny-transformer experiment in the paper (Table 2) found that the FFN radius was the bottleneck in 94% of early-training steps.

02 The fix: Gaussian convolution

The Weierstrass transform convolves f with a Gaussian of width σ, smearing out the singularity:

σ(v) = 1/(σ√(2π)) ∫0 (1/√t) · exp(−(v−t)²/(2σ²)) dt

Because the Gaussian is entire and the integrand is absolutely integrable, this integral defines an entire function of v. The singularity at v = 0 is gone. In closed form:

σ(v) = C · exp(−v²/(4σ²)) · D−1/2(−v/σ)

where D−1/2 is a parabolic cylinder function — a solution of Weber's equation. Parabolic cylinder functions are entire: no poles, no branch cuts. As σ → 0, f̃σ recovers 1/√v for v > 0.

The same principle as RIA. RIA removes ReLU's kink by convolving with a Gaussian. Here, we remove 1/√v's branch point by the same mechanism. Gaussian convolution is a universal singularity eraser.

03 Interactive comparison

See how the smoothed normalization compares to standard 1/√v. The smoothing width σ controls the tradeoff: small σ is closer to the original but retains near-singular behavior; large σ is fully smooth but deviates more.

Normalization scale factor f(v) and its derivative
0.30
Standard 1/√v — singular at 0 Smoothed f̃σ(v) — entire 1/√max(ε,v) with ε-clamp
Standard: −1/(2v3/2) — blows up Smoothed: f̃′σ(v) — bounded
Standard ρ = v/|v̇| — vanishes near 0 Smoothed: no finite singularity barrier Softmax ρa = π/Δa (reference line)

04 Normalization variants at a glance

Method Scale factor Singularity Entire? Notes
LayerNorm
Ba et al. 2016
1/√(Var(x)+ε) Branch point at Var = −ε No ε shifts singularity off zero but doesn't remove it. In complex plane, still limits radius.
RMSNorm
Zhang & Sennrich 2019
1/√(mean(x²)+ε) Branch point at mean(x²) = −ε No Same structure as LayerNorm without centering. Same singularity class.
BatchNorm
Ioffe & Szegedy 2015
1/√(Var_batch+ε) Branch point + running stats No Running statistics add extra fragility — can become corrupted during spikes (see paper Fig. 10).
ε-clamped
Common implementation
1/√max(ε, v) Kink at v = ε (not analytic) No Replaces branch point with a ReLU-like kink. Piecewise, not analytic — strictly worse from radius lens.
Weierstrass-smoothed
This paper (proposed)
C·exp(−v²/4σ²)·D−½(−v/σ) None (entire) Yes Gaussian convolution of 1/√max(0,v). Parabolic cylinder function — entire. No finite ghost barrier.
Goldschmidt approx
This paper (route 3)
N-step polynomial iteration None (polynomial) Yes Fixed iteration count = polynomial = entire. Hardware-friendly. Accuracy depends on N.
Randomized smoothing
This paper (route 2)
Eξ~N(0,σ²)[(max(0,v+ξ)+ε)−½] Stochastic — entire in expectation Stoch. Recovers the analytic function in expectation. Adds noise to forward pass — natural regularizer.

05 Three implementation routes

Deep-learning frameworks lack a native differentiable parabolic cylinder function. The paper proposes three practical paths to computing the smoothed normalization:

Taylor series

Since f̃σ is entire, its Maclaurin series converges everywhere. Precompute coefficients offline, evaluate a truncated polynomial at runtime. Accuracy controlled by truncation order.

Exact (to truncation order)

Randomized smoothing

Use f̃σ(v) = Eξ[(max(0, v+ξ) + ε)−1/2] with ξ ~ N(0, σ²). Sample ξ in the forward pass. Recovers the entire function in expectation. Acts as a natural regularizer.

Stochastic · regularizing

Goldschmidt iterations

Approximate 1/√v by a fixed number of Newton-style polynomial iterations. The N-step sequence is a polynomial (hence entire). Hardware-friendly — maps to fused multiply-add.

Fast · hardware-friendly
The Goldschmidt route is particularly appealing because it requires no special functions, is differentiable by default (it's just polynomial arithmetic), and can run at fixed latency — ideal for accelerator pipelines. The iteration y ← y · (3 − v·y²) / 2 converges quadratically to 1/√v from a reasonable initial guess. With N = 3–4 steps, accuracy is sufficient for FP16/BF16 training.

06 The singularity in the complex plane

The function 1/√z has a branch cut along the negative real axis, emanating from the branch point at z = 0. Any path in parameter space that causes the variance v to approach zero from any direction hits this singularity and caps the Taylor convergence radius.

Complex plane: branch cut vs entire function
0.80
Branch cut of 1/√z (negative real axis) Branch point at z = 0 Convergence disk — radius = v₀ Smoothed: entire — no finite barrier
After smoothing, the disk disappears. The Weierstrass-smoothed version is entire — there is no finite singularity in the complex plane, so the normalization layer contributes no ghost barrier. The remaining constraint is purely from the softmax (ρa = π/Δa) or from non-entire activations.