Analytic Normalization: Removing the Singularity at Zero

LayerNorm and RMSNorm divide by $ \sqrt{v} $, which has a branch-point singularity at $ v = 0 $. The Weierstrass transform (Gaussian convolution) replaces it with an entire function, eliminating the finite ghost barrier. An interactive companion to Ghosts of Softmax, Appendix 3.

Piyush Sao · Oak Ridge National Laboratory · github.com/piyush314/ghosts-of-softmax

01 The singularity in standard normalization

LayerNorm and RMSNorm both rely on a scale factor of the form $ f(v) = 1/\sqrt{v} $, where $ v $ is a variance or mean-square quantity computed from activations. On the positive reals this is perfectly well-behaved. But in the complex plane, $ 1/\sqrt{v} $ has a branch-point singularity at $ v = 0 $.

During the forward pass, $ v $ is always positive. But from the convergence-radius perspective, the singularity at $ v = 0 $ is still relevant: it limits the Taylor convergence radius of any function that depends on $ 1/\sqrt{v} $. If a neuron's variance evolves as $ v(t) = v_0 + t\dot v $ along the update direction, the singularity lies at distance $ |v_0|/|\dot v| $ from the current point. For neurons with small variance (common in normalized networks), this can be very close.

\[ f(v) = \frac{1}{\sqrt{v}}, \qquad \rho = \frac{|v_0|}{|\dot v|} \] \[ \tilde f_\sigma(v) = \mathcal{W}_\sigma\!\left[\frac{1}{\sqrt{\max(0,\cdot)}}\right](v), \qquad \rho = \infty \]

Why this matters: In normalized networks, many preactivations cluster near zero. The normalization singularity at $ v = 0 $ can become the binding constraint on $ \rho $, tighter than the softmax ghosts at distance $ \pi / \Delta_a $. The tiny-transformer experiment in the paper (Table 2) found that the FFN radius was the bottleneck in 94% of early-training steps.

02 The fix: Gaussian convolution

The Weierstrass transform convolves $ f $ with a Gaussian of width $ \sigma $, smearing out the singularity:

\[ \tilde f_\sigma(v) = \frac{1}{\sigma\sqrt{2\pi}} \int_0^\infty \frac{1}{\sqrt{t}} \exp\!\left(-\frac{(v-t)^2}{2\sigma^2}\right)\,dt \]

Because the Gaussian is entire and the integrand is absolutely integrable, this integral defines an entire function of $ v $. The singularity at $ v = 0 $ is gone. In closed form:

\[ \tilde f_\sigma(v) = C \exp\!\left(-\frac{v^2}{4\sigma^2}\right) D_{-1/2}\!\left(-\frac{v}{\sigma}\right) \]

where $ D_{-1/2} $ is a parabolic cylinder function, a solution of Weber's equation. Parabolic cylinder functions are entire: no poles, no branch cuts. As $ \sigma \to 0 $, $ \tilde f_\sigma $ recovers $ 1/\sqrt{v} $ for $ v > 0 $.

The same principle as RIA. RIA removes ReLU's kink by convolving with a Gaussian. Here, we remove the branch point of $ 1/\sqrt{v} $ by the same mechanism. Gaussian convolution is a universal singularity eraser.

03 Interactive comparison

See how the smoothed normalization compares to standard $ 1/\sqrt{v} $. The smoothing width $ \sigma $ controls the tradeoff: small $ \sigma $ is closer to the original but retains near-singular behavior; large $ \sigma $ is fully smooth but deviates more.

Normalization scale factor f(v) and its derivative

Width $ \sigma $ 0.30

Standard $ 1/\sqrt{v} $ singular at $ 0 $ Smoothed $ \tilde f_\sigma(v) $ is entire $ 1/\sqrt{\max(\varepsilon,v)} $ with $ \varepsilon $-clamp

Standard $ -1/(2v^{3/2}) $ blows up Smoothed $ \tilde f_\sigma'(v) $ stays bounded

Standard $ \rho = v/|\dot v| $ vanishes near $ 0 $ Smoothed: no finite singularity barrier Softmax $ \rho_a = \pi/\Delta_a $ reference line

04 Normalization variants at a glance

Method	Scale factor	Singularity	Entire?	Notes
LayerNorm Ba et al. 2016	$1/\sqrt{\mathrm{Var}(x)+\varepsilon}$	Branch point at $ \mathrm{Var}(x) = -\varepsilon $	No	$ \varepsilon $ shifts the singularity off zero but does not remove it.
RMSNorm Zhang & Sennrich 2019	$1/\sqrt{\mathrm{mean}(x^2)+\varepsilon}$	Branch point at $ \mathrm{mean}(x^2) = -\varepsilon $	No	Same structure as LayerNorm without centering. Same singularity class.
BatchNorm Ioffe & Szegedy 2015	$1/\sqrt{\mathrm{Var}_{\mathrm{batch}}+\varepsilon}$	Branch point + running stats	No	Running statistics add extra fragility — can become corrupted during spikes (see paper Fig. 10).
ε-clamped Common implementation	$1/\sqrt{\max(\varepsilon, v)}$	Kink at $ v = \varepsilon $ (not analytic)	No	Replaces branch point with a ReLU-like kink. Piecewise, not analytic — strictly worse from radius lens.
Weierstrass-smoothed This paper (proposed)	$C e^{-v^2/(4\sigma^2)} D_{-1/2}(-v/\sigma)$	None (entire)	Yes	Gaussian convolution of $ 1/\sqrt{\max(0,v)} $. No finite ghost barrier.
Goldschmidt approx This paper (route 3)	N-step polynomial iteration	None (polynomial)	Yes	Fixed iteration count = polynomial = entire. Hardware-friendly. Accuracy depends on N.
Randomized smoothing This paper (route 2)	$E_{\xi \sim \mathcal{N}(0,\sigma^2)}[(\max(0,v+\xi)+\varepsilon)^{-1/2}]$	Stochastic — entire in expectation	Stoch.	Recovers the analytic function in expectation. Adds noise to forward pass — natural regularizer.

05 Three implementation routes

Deep-learning frameworks lack a native differentiable parabolic cylinder function. The paper proposes three practical paths to computing the smoothed normalization:

Taylor series

Since $ \tilde f_\sigma $ is entire, its Maclaurin series converges everywhere. Precompute coefficients offline, evaluate a truncated polynomial at runtime. Accuracy controlled by truncation order.

Exact (to truncation order)

Randomized smoothing

Use $ \tilde f_\sigma(v) = E_\xi[(\max(0, v+\xi) + \varepsilon)^{-1/2}] $ with $ \xi \sim \mathcal{N}(0, \sigma^2) $. Sample ξ in the forward pass. Recovers the entire function in expectation. Acts as a natural regularizer.

Stochastic · regularizing

Goldschmidt iterations

Approximate $ 1/\sqrt{v} $ by a fixed number of Newton-style polynomial iterations. The N-step sequence is a polynomial (hence entire). Hardware-friendly — maps to fused multiply-add.

Fast · hardware-friendly

The Goldschmidt route is particularly appealing because it requires no special functions, is differentiable by default (it's just polynomial arithmetic), and can run at fixed latency — ideal for accelerator pipelines. The iteration

y \leftarrow y(3-vy^2)/2

converges quadratically to $ 1/\sqrt{v} $ from a reasonable initial guess. With 3–4 steps, accuracy is sufficient for FP16/BF16 training.

06 The singularity in the complex plane

The function $ 1/\sqrt{z} $ has a branch cut along the negative real axis, emanating from the branch point at $ z = 0 $. Any path in parameter space that causes the variance $ v $ to approach zero from any direction hits this singularity and caps the Taylor convergence radius.

Complex plane: branch cut vs entire function

Variance $ v_0 $ 0.80

Branch cut of $ 1/\sqrt{z} $ on the negative real axis Branch point at $ z = 0 $ Convergence disk with radius $ v_0 $ Smoothed: entire — no finite barrier

After smoothing, the disk disappears. The Weierstrass-smoothed version is entire — there is no finite singularity in the complex plane, so the normalization layer contributes no ghost barrier. The remaining constraint is purely from the softmax ($ \rho_a = \pi/\Delta_a $) or from non-entire activations.

Method	Scale factor	Singularity	Entire?	Notes
LayerNorm Ba et al. 2016	$1/\sqrt{\mathrm{Var}(x)+\varepsilon}$	Branch point at \( \mathrm{Var}(x) = -\varepsilon \)	No	\( \varepsilon \) shifts the singularity off zero but does not remove it.
RMSNorm Zhang & Sennrich 2019	$1/\sqrt{\mathrm{mean}(x^2)+\varepsilon}$	Branch point at \( \mathrm{mean}(x^2) = -\varepsilon \)	No	Same structure as LayerNorm without centering. Same singularity class.
BatchNorm Ioffe & Szegedy 2015	$1/\sqrt{\mathrm{Var}_{\mathrm{batch}}+\varepsilon}$	Branch point + running stats	No	Running statistics add extra fragility — can become corrupted during spikes (see paper Fig. 10).
ε-clamped Common implementation	$1/\sqrt{\max(\varepsilon, v)}$	Kink at \( v = \varepsilon \) (not analytic)	No	Replaces branch point with a ReLU-like kink. Piecewise, not analytic — strictly worse from radius lens.
Weierstrass-smoothed This paper (proposed)	$C e^{-v^2/(4\sigma^2)} D_{-1/2}(-v/\sigma)$	None (entire)	Yes	Gaussian convolution of \( 1/\sqrt{\max(0,v)} \). No finite ghost barrier.
Goldschmidt approx This paper (route 3)	N-step polynomial iteration	None (polynomial)	Yes	Fixed iteration count = polynomial = entire. Hardware-friendly. Accuracy depends on N.
Randomized smoothing This paper (route 2)	$E_{\xi \sim \mathcal{N}(0,\sigma^2)}[(\max(0,v+\xi)+\varepsilon)^{-1/2}]$	Stochastic — entire in expectation	Stoch.	Recovers the analytic function in expectation. Adds noise to forward pass — natural regularizer.