Tutorial 1: Adam + rho versus Fixed-LR Adam¶

This notebook compares two ways to run Adam at the same base learning rates:

  • plain fixed-LR Adam,
  • the exact directional rho controller.

The main lesson is that the exact controller should act on Adam's actual proposed step:

  • let Adam propose an update direction and step norm,
  • compute rho_a(v) along that proposed direction,
  • rescale the step only if the proposal is too large relative to rho_a.

The baseline is ordinary Adam with a fixed learning rate and no controller.

Both methods use the same model, data, and base learning rates. The key question is whether the exact controller keeps the realized normalized step ratio r = tau / rho_a near the geometric target while fixed-LR Adam overshoots.

In [1]:
from pathlib import Path
import importlib.util
import subprocess
import sys


def in_colab():
    try:
        import google.colab  # type: ignore
        return True
    except ImportError:
        return False


def find_repo_root():
    cwd = Path.cwd().resolve()
    for base in [cwd, *cwd.parents]:
        if (base / "src" / "ghosts").exists() and (
            (base / "tutorials").exists() or (base / "experiments").exists()
        ):
            return base

    if in_colab():
        repo = Path('/content/ghosts-of-softmax')
        if not repo.exists():
            subprocess.run(
                [
                    'git', 'clone', '--depth', '1',
                    'https://github.com/piyush314/ghosts-of-softmax.git',
                    str(repo),
                ],
                check=True,
            )
        return repo

    raise RuntimeError(
        'Run this notebook from inside the ghosts-of-softmax repository, '
        'or open it in Google Colab so the setup cell can clone the repo automatically.'
    )


REPO = find_repo_root()
SRC = REPO / "src"
if str(SRC) not in sys.path:
    sys.path.insert(0, str(SRC))


def load_module(name, relative_path):
    path = REPO / relative_path
    module_dir = str(path.parent)
    if module_dir not in sys.path:
        sys.path.insert(0, module_dir)
    spec = importlib.util.spec_from_file_location(name, path)
    module = importlib.util.module_from_spec(spec)
    sys.modules[name] = module
    spec.loader.exec_module(module)
    return module


OUTPUT_ROOT = Path('/tmp/ghosts-of-softmax-notebooks')
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
print(f"Repo root: {REPO}")
print(f"Notebook outputs: {OUTPUT_ROOT}")
Repo root: /home/runner/work/ghosts-of-softmax/ghosts-of-softmax
Notebook outputs: /tmp/ghosts-of-softmax-notebooks

1. Setup¶

We keep this tutorial small and deliberate:

  • one architecture (MLP),
  • one seed,
  • three Adam base learning rates,
  • one optional multiseed block at the end.

The three base learning rates play different roles:

  • 0.001: safely below the radius limit,
  • 0.5: large enough that the exact controller should intervene,
  • 5.0: intentionally absurd, so the difference between the two controllers becomes obvious.
In [2]:
import math
import random

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, display
from ghosts.plotting import add_end_labels, add_subtitle, apply_plot_style, finish_figure, format_percent_axis


def display_table(rows, columns=None, formats=None):
    if not rows:
        print("No rows to display.")
        return
    if columns is None:
        columns = list(rows[0].keys())
    formats = formats or {}
    parts = ['<table style="border-collapse:collapse">', '<thead><tr>']
    for col in columns:
        parts.append(f'<th style="text-align:left;padding:4px 8px;border-bottom:1px solid #ccc">{col}</th>')
    parts.append('</tr></thead><tbody>')
    for row in rows:
        parts.append('<tr>')
        for col in columns:
            value = row.get(col, '')
            if col in formats:
                value = formats[col](value)
            parts.append(f'<td style="padding:4px 8px">{value}</td>')
        parts.append('</tr>')
    parts.append('</tbody></table>')
    display(HTML(''.join(parts)))
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from torch.func import functional_call, jvp
from torch.utils.data import DataLoader, TensorDataset


torch.set_num_threads(1)

DEVICE = torch.device("cpu")
LOW_LR = 0.001
MID_LR = 0.5
HIGH_LR = 2.0
LRS = [LOW_LR, MID_LR, HIGH_LR]
EPOCHS = 20
BATCH_SIZE = 128
TARGET_R = 1.0
SEED = 7

RUN_MULTI_SEED = False
MULTI_SEEDS = [0, 1, 2, 3, 4]

PALETTE = {
    "exact": "#006BA2",
    "fixed": "#E3120B",
    "dark": "#3D3D3D",
}


def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

2. Model and data¶

As in Tutorial 0, we use the sklearn digits dataset and a small MLP. That keeps the comparison focused on the controller logic rather than on architectural details.

In [3]:
class MLP(nn.Module):
    def __init__(self, width=128):
        super().__init__()
        self.fc1 = nn.Linear(64, width)
        self.fc2 = nn.Linear(width, width)
        self.fc3 = nn.Linear(width, 10)

    def forward(self, x):
        x = F.gelu(self.fc1(x))
        x = F.gelu(self.fc2(x))
        return self.fc3(x)


def make_model():
    return MLP().to(DEVICE)


def build_digits_loaders(seed: int, batch_size: int = 128):
    digits = load_digits()
    X = digits.data.astype(np.float32) / 16.0
    y = digits.target.astype(np.int64)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=seed, stratify=y
    )
    train_ds = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
    test_ds = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_ds, batch_size=256, shuffle=False)
    return train_loader, test_loader


train_loader, test_loader = build_digits_loaders(SEED, BATCH_SIZE)
len(train_loader), len(test_loader)
Out[3]:
(11, 2)

3. The exact Adam + $\rho$ controller¶

For Adam, the controller must act on Adam's actual proposed step, not on the raw gradient alone.

Write Adam's next unit-learning-rate proposal as

$$ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2, $$

$$ \hat m_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat v_t = \frac{v_t}{1-\beta_2^t}, $$

$$ u_{\mathrm{unit}} = \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} $$

(and in this notebook the code also includes Adam's weight-decay term when present). This u_unit is the optimizer's proposed update for learning rate 1.

Once that proposal is available, the exact controller is

$$ \tau_{\mathrm{raw}} = \eta_{\mathrm{base}} \lVert u_{\mathrm{unit}} \rVert, \qquad v = -\frac{u_{\mathrm{unit}}}{\lVert u_{\mathrm{unit}} \rVert}, \qquad \rho_{\mathrm{batch}} = \rho_a(v), $$

$$ \eta_{\mathrm{eff}} = \min\!\left(\eta_{\mathrm{base}}, \frac{r_{\mathrm{target}}\,\rho_{\mathrm{batch}}}{\lVert u_{\mathrm{unit}} \rVert}\right). $$

So the direction always stays Adam's direction. Only the step length is changed.

Code mapping for this notebook:

  • u_unit comes from adam_unit_step(...)
  • ||u_unit|| is unit_norm
  • \rho_a(v) is computed by batch_rho_jvp(...)
  • eta_eff is written into Adam's parameter groups before opt.step()

4. Fixed learning-rate Adam¶

The baseline is ordinary Adam with a fixed base learning rate and no rho controller.

opt = torch.optim.Adam(model.parameters(), lr=base_lr)
opt.step()

This baseline is the right comparison because it isolates exactly what the controller changes:

  • the optimizer direction is still Adam's direction,
  • the exact controller only shortens the step when tau_raw > r_target * rho_a,
  • if the base learning rate is already safe, the exact controller behaves like fixed-LR Adam.
In [4]:
def evaluate(model, loader):
    model.eval()
    total_loss = 0.0
    total_correct = 0
    total_count = 0
    with torch.no_grad():
        for xb, yb in loader:
            xb = xb.to(DEVICE)
            yb = yb.to(DEVICE)
            logits = model(xb)
            loss = F.cross_entropy(logits, yb)
            total_loss += float(loss.item()) * len(xb)
            total_correct += int((logits.argmax(dim=1) == yb).sum())
            total_count += len(xb)
    return total_loss / total_count, total_correct / total_count


def adam_unit_step(model, opt):
    group_by_param = {}
    for group in opt.param_groups:
        for p in group['params']:
            group_by_param[p] = group

    step_vecs = []
    sq = 0.0
    for p in model.parameters():
        if p not in group_by_param or p.grad is None:
            z = torch.zeros(p.numel(), device=p.device, dtype=p.dtype)
            step_vecs.append(z)
            continue

        group = group_by_param[p]
        beta1, beta2 = group['betas']
        eps = float(group['eps'])
        wd = float(group.get('weight_decay', 0.0))
        maximize = bool(group.get('maximize', False))

        g = p.grad.detach()
        if maximize:
            g = -g

        state = opt.state.get(p, {})
        if 'exp_avg' in state:
            m_prev = state['exp_avg']
            v_prev = state['exp_avg_sq']
            step_prev_obj = state.get('step', 0)
            step_prev = int(step_prev_obj.item()) if torch.is_tensor(step_prev_obj) else int(step_prev_obj)
        else:
            m_prev = torch.zeros_like(p)
            v_prev = torch.zeros_like(p)
            step_prev = 0

        step_t = step_prev + 1
        m_t = beta1 * m_prev + (1.0 - beta1) * g
        v_t = beta2 * v_prev + (1.0 - beta2) * (g * g)
        bc1 = 1.0 - (beta1 ** step_t)
        bc2 = 1.0 - (beta2 ** step_t)
        denom = v_t.sqrt() / math.sqrt(max(bc2, 1e-16)) + eps
        adam_term = (m_t / max(bc1, 1e-16)) / denom
        update = adam_term
        if wd != 0.0:
            update = update + wd * p.detach()

        step_vecs.append(update.flatten())
        sq += float((update * update).sum().item())

    step_vec = torch.cat(step_vecs)
    norm = math.sqrt(sq)
    return step_vec, norm


def batch_rho_jvp(model, xb, v_flat):
    params = dict(model.named_parameters())
    tangents = {}
    offset = 0
    for name, p in params.items():
        numel = p.numel()
        tangents[name] = v_flat[offset:offset + numel].view_as(p)
        offset += numel

    def fwd(pdict):
        return functional_call(model, pdict, (xb,))

    was_training = model.training
    model.eval()
    _, dlogits = jvp(fwd, (params,), (tangents,))
    if was_training:
        model.train()

    spread = dlogits.max(dim=1).values - dlogits.min(dim=1).values
    delta_a = float(spread.max().item())
    return math.pi / max(delta_a, 1e-12)


def run_training(mode: str, base_lr: float, seed: int):
    set_seed(seed)
    train_loader, test_loader = build_digits_loaders(seed, BATCH_SIZE)
    model = make_model()
    opt = torch.optim.Adam(model.parameters(), lr=base_lr)

    history = {
        'train_loss': [],
        'test_acc': [],
        'max_r': [],
        'mean_r': [],
        'mean_eff_lr': [],
        'mean_rho': [],
    }

    for _ in range(EPOCHS):
        model.train()
        batch_losses = []
        batch_r = []
        batch_eff_lr = []
        batch_rho = []

        for xb, yb in train_loader:
            xb = xb.to(DEVICE)
            yb = yb.to(DEVICE)
            opt.zero_grad(set_to_none=True)
            logits = model(xb)
            loss = F.cross_entropy(logits, yb)
            loss.backward()

            unit_step_vec, unit_norm = adam_unit_step(model, opt)
            if unit_norm < 1e-12:
                eff_lr = float(base_lr)
                for group in opt.param_groups:
                    group['lr'] = eff_lr
                opt.step()
                rho_a = float('inf')
                r = 0.0
            else:
                v_dir = -unit_step_vec / unit_norm
                rho_a = batch_rho_jvp(model, xb, v_dir)
                if mode == 'exact':
                    eff_lr = min(base_lr, TARGET_R * rho_a / unit_norm)
                elif mode == 'fixed':
                    eff_lr = float(base_lr)
                else:
                    raise ValueError(mode)

                for group in opt.param_groups:
                    group['lr'] = eff_lr
                opt.step()

                tau = eff_lr * unit_norm
                r = tau / rho_a if rho_a > 0 else float('inf')

            batch_losses.append(float(loss.item()))
            batch_r.append(float(r))
            batch_eff_lr.append(float(eff_lr))
            batch_rho.append(float(rho_a if math.isfinite(rho_a) else 0.0))

        _, test_acc = evaluate(model, test_loader)
        history['train_loss'].append(float(np.mean(batch_losses)))
        history['test_acc'].append(float(test_acc))
        history['max_r'].append(float(np.max(batch_r)))
        history['mean_r'].append(float(np.mean(batch_r)))
        history['mean_eff_lr'].append(float(np.mean(batch_eff_lr)))
        history['mean_rho'].append(float(np.mean(batch_rho)))

    return history


RUN_SPECS = [
    {'key': 'exact_low', 'mode': 'exact', 'base_lr': LOW_LR, 'label': f'exact Adam+rho (LR={LOW_LR})'},
    {'key': 'exact_mid', 'mode': 'exact', 'base_lr': MID_LR, 'label': f'exact Adam+rho (LR={MID_LR})'},
    {'key': 'exact_high', 'mode': 'exact', 'base_lr': HIGH_LR, 'label': f'exact Adam+rho (LR={HIGH_LR})'},
    {'key': 'fixed_low', 'mode': 'fixed', 'base_lr': LOW_LR, 'label': f'fixed-LR Adam (LR={LOW_LR})'},
    {'key': 'fixed_mid', 'mode': 'fixed', 'base_lr': MID_LR, 'label': f'fixed-LR Adam (LR={MID_LR})'},
    {'key': 'fixed_high', 'mode': 'fixed', 'base_lr': HIGH_LR, 'label': f'fixed-LR Adam (LR={HIGH_LR})'},
]


def final_summary_row(spec, hist):
    return {
        'run': spec['label'],
        'final_acc': hist['test_acc'][-1],
        'peak_r': max(hist['max_r']),
        'final_mean_r': hist['mean_r'][-1],
        'final_mean_eff_lr': hist['mean_eff_lr'][-1],
        'final_mean_rho': hist['mean_rho'][-1],
    }

5. Single-seed run¶

We now run the exact controller and fixed-LR Adam on the same model and split.

The comparison to watch is straightforward:

  • the exact controller should drive the realized step ratio toward r ~= 1 once the base learning rate is large enough to matter,
  • fixed-LR Adam may still take realized steps with r >> 1 because it never shortens Adam's proposed step.
In [5]:
single_seed = {}
for spec in RUN_SPECS:
    single_seed[spec['key']] = run_training(spec['mode'], spec['base_lr'], SEED)

summary_rows = [final_summary_row(spec, single_seed[spec['key']]) for spec in RUN_SPECS]
display_table(
    summary_rows,
    columns=['run', 'final_acc', 'peak_r', 'final_mean_r', 'final_mean_eff_lr', 'final_mean_rho'],
    formats={
        'final_acc': lambda x: f"{float(x):.3f}",
        'peak_r': lambda x: f"{float(x):.3f}",
        'final_mean_r': lambda x: f"{float(x):.3f}",
        'final_mean_eff_lr': lambda x: f"{float(x):.6f}",
        'final_mean_rho': lambda x: f"{float(x):.6f}",
    },
)
runfinal_accpeak_rfinal_mean_rfinal_mean_eff_lrfinal_mean_rho
exact Adam+rho (LR=0.001)0.9690.3580.1330.0010000.200699
exact Adam+rho (LR=0.5)0.9801.0001.0000.0235000.302561
exact Adam+rho (LR=2.0)0.9801.0001.0000.0235000.302561
fixed-LR Adam (LR=0.001)0.9690.3580.1330.0010000.200699
fixed-LR Adam (LR=0.5)0.1894971.62253.7040.5000000.023873
fixed-LR Adam (LR=2.0)0.100283819.0420.1732.0000002.865052

6. First look: loss and accuracy¶

Start with the most legible outputs:

  • training loss on a log scale,
  • test accuracy on a linear scale.

Reading key:

  • color = method (exact vs fixed LR),
  • line style = base learning rate,
  • markers help when curves overlap.
In [6]:
apply_plot_style(font_size=10, title_size=12, label_size=10, tick_size=9)

epochs = np.arange(1, EPOCHS + 1)
style_map = {
    'exact_low': {'color': PALETTE['exact'], 'ls': '-', 'marker': 'o', 'alpha': 0.82},
    'exact_mid': {'color': PALETTE['exact'], 'ls': '--', 'marker': 'o', 'alpha': 0.82},
    'exact_high': {'color': PALETTE['exact'], 'ls': '-.', 'marker': 'o', 'alpha': 0.82},
    'fixed_low': {'color': PALETTE['fixed'], 'ls': '-', 'marker': 's', 'alpha': 0.82},
    'fixed_mid': {'color': PALETTE['fixed'], 'ls': '--', 'marker': 's', 'alpha': 0.82},
    'fixed_high': {'color': PALETTE['fixed'], 'ls': '-.', 'marker': 's', 'alpha': 0.82},
}

fig, axes = plt.subplots(1, 2, figsize=(11.8, 4.4), sharex=True)
loss_labels = []
acc_labels = []

for spec in RUN_SPECS:
    key = spec['key']
    hist = single_seed[key]
    style = style_map[key]
    axes[0].semilogy(
        epochs, hist['train_loss'],
        color=style['color'], ls=style['ls'], lw=2.2,
        marker=style['marker'], ms=4.5, markevery=2, alpha=style['alpha']
    )
    axes[1].plot(
        epochs, hist['test_acc'],
        color=style['color'], ls=style['ls'], lw=2.2,
        marker=style['marker'], ms=4.5, markevery=2, alpha=style['alpha']
    )
    weight = 'bold' if key.startswith('exact') else None
    loss_labels.append((hist['train_loss'][-1], spec['label'], style['color'], weight))
    acc_labels.append((hist['test_acc'][-1], spec['label'], style['color'], weight))

axes[0].set_title('Exact step control keeps Adam stable at aggressive LR', loc='left', fontweight='bold')
add_subtitle(axes[0], "Both runs use Adam; the controller only shortens the proposed step when it is too long.", fontsize=9)
axes[1].set_title('Accuracy stays higher when the controller enforces tau <= rho', loc='left', fontweight='bold')
add_subtitle(axes[1], 'Fixed-LR Adam overshoots once the proposed step exceeds the local radius.', fontsize=9)
axes[0].set_ylabel('training loss')
axes[1].set_ylabel('test accuracy')
axes[1].set_ylim(0.0, 1.02)
format_percent_axis(axes[1], xmax=1.0)

for ax in axes:
    ax.set_xlabel('epoch')
    ax.grid(True, alpha=0.25)

add_end_labels(axes[0], epochs, loss_labels, fontsize=7)
add_end_labels(axes[1], epochs, acc_labels, fontsize=7)
fig.suptitle('Exact directional Adam controller versus fixed-LR Adam', y=0.99, fontsize=12, fontweight='bold')
finish_figure(fig, rect=[0, 0, 1, 0.94])
plt.show()
No description has been provided for this image

7. Diagnostic view: effective learning rate and realized step ratio¶

Now inspect the controller quantities directly.

The left panel shows the mean effective learning rate used over each epoch. The right panel shows the realized normalized step size r = tau / rho_a measured from the actual Adam update direction.

This is the most important panel in the notebook: the exact controller should lock onto the geometric target, while fixed-LR Adam can drift far above it.

In [7]:
fig, axes = plt.subplots(1, 2, figsize=(11.8, 4.4), sharex=True)
eta_labels = []
r_labels = []

for spec in RUN_SPECS:
    key = spec['key']
    hist = single_seed[key]
    style = style_map[key]
    axes[0].semilogy(
        epochs, hist['mean_eff_lr'],
        color=style['color'], ls=style['ls'], lw=2.1,
        marker=style['marker'], ms=4.0, markevery=2, alpha=0.72
    )
    axes[1].semilogy(
        epochs, hist['max_r'],
        color=style['color'], ls=style['ls'], lw=2.1,
        marker=style['marker'], ms=4.0, markevery=2, alpha=0.72
    )
    weight = 'bold' if key.startswith('exact') else None
    eta_labels.append((hist['mean_eff_lr'][-1], spec['label'], style['color'], weight))
    r_labels.append((hist['max_r'][-1], spec['label'], style['color'], weight))

axes[0].set_title('The controller reduces effective LR only when Adam would overshoot', loc='left', fontweight='bold')
add_subtitle(axes[0], "At safe LR the exact and fixed runs are nearly the same; at larger LR the controller backs off.", fontsize=9)
axes[1].set_title('Exact control keeps the realized step near the target boundary', loc='left', fontweight='bold')
add_subtitle(axes[1], 'Fixed-LR Adam can drive r far above 1 at the same base learning rate.', fontsize=9)
axes[0].set_ylabel('effective learning rate')
axes[1].set_ylabel('max r per epoch')
axes[1].axhline(TARGET_R, color=PALETTE['dark'], ls=':', lw=1.2)

for ax in axes:
    ax.set_xlabel('epoch')
    ax.grid(True, alpha=0.25, which='both')

add_end_labels(axes[0], epochs, eta_labels, fontsize=7)
add_end_labels(axes[1], epochs, r_labels, fontsize=7)
fig.suptitle('Adam controller diagnostics', y=0.99, fontsize=12, fontweight='bold')
finish_figure(fig, rect=[0, 0, 1, 0.94])
plt.show()
No description has been provided for this image

8. Summary table¶

The figure gives the visual story. The table below shows the same comparison numerically.

In [8]:
single_seed = {}
for spec in RUN_SPECS:
    single_seed[spec['key']] = run_training(spec['mode'], spec['base_lr'], SEED)

summary_rows = [final_summary_row(spec, single_seed[spec['key']]) for spec in RUN_SPECS]
display_table(
    summary_rows,
    columns=['run', 'final_acc', 'peak_r', 'final_mean_r', 'final_mean_eff_lr', 'final_mean_rho'],
    formats={
        'final_acc': lambda x: f"{float(x):.3f}",
        'peak_r': lambda x: f"{float(x):.3f}",
        'final_mean_r': lambda x: f"{float(x):.3f}",
        'final_mean_eff_lr': lambda x: f"{float(x):.6f}",
        'final_mean_rho': lambda x: f"{float(x):.6f}",
    },
)
runfinal_accpeak_rfinal_mean_rfinal_mean_eff_lrfinal_mean_rho
exact Adam+rho (LR=0.001)0.9690.3580.1330.0010000.200699
exact Adam+rho (LR=0.5)0.9801.0001.0000.0235000.302561
exact Adam+rho (LR=2.0)0.9801.0001.0000.0235000.302561
fixed-LR Adam (LR=0.001)0.9690.3580.1330.0010000.200699
fixed-LR Adam (LR=0.5)0.1894971.62253.7040.5000000.023873
fixed-LR Adam (LR=2.0)0.100283819.0420.1732.0000002.865052

9. Optional: multiseed check¶

If you want a more stable comparison, flip RUN_MULTI_SEED = True and rerun the next cell. The aggregated table reports mean final accuracy and median peak step ratio across seeds.

In [9]:
if RUN_MULTI_SEED:
    grouped_rows = []
    for spec in RUN_SPECS:
        rows = []
        for seed in MULTI_SEEDS:
            hist = run_training(spec['mode'], spec['base_lr'], seed)
            rows.append({
                'final_acc': hist['test_acc'][-1],
                'peak_r': max(hist['max_r']),
            })
        grouped_rows.append({
            'run': spec['label'],
            'final_acc_mean': float(np.mean([row['final_acc'] for row in rows])),
            'final_acc_std': float(np.std([row['final_acc'] for row in rows], ddof=1)) if len(rows) > 1 else 0.0,
            'peak_r_median': float(np.median([row['peak_r'] for row in rows])),
        })

    display_table(
        grouped_rows,
        columns=['run', 'final_acc_mean', 'final_acc_std', 'peak_r_median'],
        formats={
            'final_acc_mean': lambda x: f"{float(x):.3f}",
            'final_acc_std': lambda x: f"{float(x):.3f}",
            'peak_r_median': lambda x: f"{float(x):.3f}",
        },
    )
else:
    print('Set RUN_MULTI_SEED = True to execute the multiseed check.')
Set RUN_MULTI_SEED = True to execute the multiseed check.

10. What to remember¶

  • The correct way to apply Adam + rho is to build Adam's actual proposed step u_unit first.
  • Then compute rho_a along Adam's proposed direction v = -u_unit / ||u_unit||.
  • Then set the effective learning rate to
eta_eff = min(eta_base, r_target * rho_a / ||u_unit||)

so the realized step length satisfies tau <= r_target * rho_a.

  • Relative to fixed-LR Adam, the controller keeps Adam's direction and only shortens the step when the proposal is too long for the local radius.

For another optimizer, the template is the same:

  1. let the optimizer propose an update,
  2. extract its direction and norm,
  3. compute rho_a along that direction,
  4. shorten the step only if it is too long.