LayerNorm and RMSNorm divide by √v — which has a branch-point singularity at v = 0. The Weierstrass transform (Gaussian convolution) replaces it with an entire function, eliminating the finite ghost barrier. An interactive companion to Ghosts of Softmax, Appendix 3.
LayerNorm and RMSNorm both rely on a scale factor of the form f(v) = 1/√v, where v is a variance or mean-square quantity computed from activations. On the positive reals this is perfectly well-behaved. But in the complex plane, 1/√v has a branch-point singularity at v = 0.
During the forward pass, v is always positive. But from the convergence-radius perspective, the singularity at v = 0 is still relevant — it limits the Taylor convergence radius of any function that depends on 1/√v. If a neuron's variance v evolves as v(t) = v₀ + ṫv along the update direction, the singularity lies at distance |v₀|/|v̇| from the current point. For neurons with small variance (common in normalized networks), this can be very close.
The Weierstrass transform convolves f with a Gaussian of width σ, smearing out the singularity:
Because the Gaussian is entire and the integrand is absolutely integrable, this integral defines an entire function of v. The singularity at v = 0 is gone. In closed form:
where D−1/2 is a parabolic cylinder function — a solution of Weber's equation. Parabolic cylinder functions are entire: no poles, no branch cuts. As σ → 0, f̃σ recovers 1/√v for v > 0.
See how the smoothed normalization compares to standard 1/√v. The smoothing width σ controls the tradeoff: small σ is closer to the original but retains near-singular behavior; large σ is fully smooth but deviates more.
| Method | Scale factor | Singularity | Entire? | Notes |
|---|---|---|---|---|
| LayerNorm Ba et al. 2016 |
1/√(Var(x)+ε) |
Branch point at Var = −ε | No | ε shifts singularity off zero but doesn't remove it. In complex plane, still limits radius. |
| RMSNorm Zhang & Sennrich 2019 |
1/√(mean(x²)+ε) |
Branch point at mean(x²) = −ε | No | Same structure as LayerNorm without centering. Same singularity class. |
| BatchNorm Ioffe & Szegedy 2015 |
1/√(Var_batch+ε) |
Branch point + running stats | No | Running statistics add extra fragility — can become corrupted during spikes (see paper Fig. 10). |
| ε-clamped Common implementation |
1/√max(ε, v) |
Kink at v = ε (not analytic) | No | Replaces branch point with a ReLU-like kink. Piecewise, not analytic — strictly worse from radius lens. |
| Weierstrass-smoothed This paper (proposed) |
C·exp(−v²/4σ²)·D−½(−v/σ) |
None (entire) | Yes | Gaussian convolution of 1/√max(0,v). Parabolic cylinder function — entire. No finite ghost barrier. |
| Goldschmidt approx This paper (route 3) |
N-step polynomial iteration | None (polynomial) | Yes | Fixed iteration count = polynomial = entire. Hardware-friendly. Accuracy depends on N. |
| Randomized smoothing This paper (route 2) |
Eξ~N(0,σ²)[(max(0,v+ξ)+ε)−½] |
Stochastic — entire in expectation | Stoch. | Recovers the analytic function in expectation. Adds noise to forward pass — natural regularizer. |
Deep-learning frameworks lack a native differentiable parabolic cylinder function. The paper proposes three practical paths to computing the smoothed normalization:
Since f̃σ is entire, its Maclaurin series converges everywhere. Precompute coefficients offline, evaluate a truncated polynomial at runtime. Accuracy controlled by truncation order.
Exact (to truncation order)Use f̃σ(v) = Eξ[(max(0, v+ξ) + ε)−1/2] with ξ ~ N(0, σ²). Sample ξ in the forward pass. Recovers the entire function in expectation. Acts as a natural regularizer.
Stochastic · regularizingApproximate 1/√v by a fixed number of Newton-style polynomial iterations. The N-step sequence is a polynomial (hence entire). Hardware-friendly — maps to fused multiply-add.
Fast · hardware-friendlyy ← y · (3 − v·y²) / 2 converges quadratically to 1/√v from a reasonable initial guess.
With N = 3–4 steps, accuracy is sufficient for FP16/BF16 training.
The function 1/√z has a branch cut along the negative real axis, emanating from the branch point at z = 0. Any path in parameter space that causes the variance v to approach zero from any direction hits this singularity and caps the Taylor convergence radius.