Tutorial 1: Adam + rho versus Fixed-LR Adam¶
This notebook compares two ways to run Adam at the same base learning rates:
- plain fixed-LR Adam,
- the exact directional
rhocontroller.
The main lesson is that the exact controller should act on Adam's actual proposed step:
- let Adam propose an update direction and step norm,
- compute
rho_a(v)along that proposed direction, - rescale the step only if the proposal is too large relative to
rho_a.
The baseline is ordinary Adam with a fixed learning rate and no controller.
Both methods use the same model, data, and base learning rates. The key question is whether the exact controller keeps the realized normalized step ratio r = tau / rho_a near the geometric target while fixed-LR Adam overshoots.
from pathlib import Path
import importlib.util
import subprocess
import sys
def in_colab():
try:
import google.colab # type: ignore
return True
except ImportError:
return False
def find_repo_root():
cwd = Path.cwd().resolve()
for base in [cwd, *cwd.parents]:
if (base / "src" / "ghosts").exists() and (
(base / "tutorials").exists() or (base / "experiments").exists()
):
return base
if in_colab():
repo = Path('/content/ghosts-of-softmax')
if not repo.exists():
subprocess.run(
[
'git', 'clone', '--depth', '1',
'https://github.com/piyush314/ghosts-of-softmax.git',
str(repo),
],
check=True,
)
return repo
raise RuntimeError(
'Run this notebook from inside the ghosts-of-softmax repository, '
'or open it in Google Colab so the setup cell can clone the repo automatically.'
)
REPO = find_repo_root()
SRC = REPO / "src"
if str(SRC) not in sys.path:
sys.path.insert(0, str(SRC))
def load_module(name, relative_path):
path = REPO / relative_path
module_dir = str(path.parent)
if module_dir not in sys.path:
sys.path.insert(0, module_dir)
spec = importlib.util.spec_from_file_location(name, path)
module = importlib.util.module_from_spec(spec)
sys.modules[name] = module
spec.loader.exec_module(module)
return module
OUTPUT_ROOT = Path('/tmp/ghosts-of-softmax-notebooks')
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
print(f"Repo root: {REPO}")
print(f"Notebook outputs: {OUTPUT_ROOT}")
Repo root: /home/runner/work/ghosts-of-softmax/ghosts-of-softmax Notebook outputs: /tmp/ghosts-of-softmax-notebooks
1. Setup¶
We keep this tutorial small and deliberate:
- one architecture (
MLP), - one seed,
- three Adam base learning rates,
- one optional multiseed block at the end.
The three base learning rates play different roles:
0.001: safely below the radius limit,0.5: large enough that the exact controller should intervene,5.0: intentionally absurd, so the difference between the two controllers becomes obvious.
import math
import random
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, display
from ghosts.plotting import add_end_labels, add_subtitle, apply_plot_style, finish_figure, format_percent_axis
def display_table(rows, columns=None, formats=None):
if not rows:
print("No rows to display.")
return
if columns is None:
columns = list(rows[0].keys())
formats = formats or {}
parts = ['<table style="border-collapse:collapse">', '<thead><tr>']
for col in columns:
parts.append(f'<th style="text-align:left;padding:4px 8px;border-bottom:1px solid #ccc">{col}</th>')
parts.append('</tr></thead><tbody>')
for row in rows:
parts.append('<tr>')
for col in columns:
value = row.get(col, '')
if col in formats:
value = formats[col](value)
parts.append(f'<td style="padding:4px 8px">{value}</td>')
parts.append('</tr>')
parts.append('</tbody></table>')
display(HTML(''.join(parts)))
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from torch.func import functional_call, jvp
from torch.utils.data import DataLoader, TensorDataset
torch.set_num_threads(1)
DEVICE = torch.device("cpu")
LOW_LR = 0.001
MID_LR = 0.5
HIGH_LR = 2.0
LRS = [LOW_LR, MID_LR, HIGH_LR]
EPOCHS = 20
BATCH_SIZE = 128
TARGET_R = 1.0
SEED = 7
RUN_MULTI_SEED = False
MULTI_SEEDS = [0, 1, 2, 3, 4]
PALETTE = {
"exact": "#006BA2",
"fixed": "#E3120B",
"dark": "#3D3D3D",
}
def set_seed(seed: int):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
2. Model and data¶
As in Tutorial 0, we use the sklearn digits dataset and a small MLP. That keeps the comparison focused on the controller logic rather than on architectural details.
class MLP(nn.Module):
def __init__(self, width=128):
super().__init__()
self.fc1 = nn.Linear(64, width)
self.fc2 = nn.Linear(width, width)
self.fc3 = nn.Linear(width, 10)
def forward(self, x):
x = F.gelu(self.fc1(x))
x = F.gelu(self.fc2(x))
return self.fc3(x)
def make_model():
return MLP().to(DEVICE)
def build_digits_loaders(seed: int, batch_size: int = 128):
digits = load_digits()
X = digits.data.astype(np.float32) / 16.0
y = digits.target.astype(np.int64)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=seed, stratify=y
)
train_ds = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
test_ds = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=256, shuffle=False)
return train_loader, test_loader
train_loader, test_loader = build_digits_loaders(SEED, BATCH_SIZE)
len(train_loader), len(test_loader)
(11, 2)
3. The exact Adam + $\rho$ controller¶
For Adam, the controller must act on Adam's actual proposed step, not on the raw gradient alone.
Write Adam's next unit-learning-rate proposal as
$$ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2, $$
$$ \hat m_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat v_t = \frac{v_t}{1-\beta_2^t}, $$
$$ u_{\mathrm{unit}} = \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} $$
(and in this notebook the code also includes Adam's weight-decay term when present). This u_unit is the optimizer's proposed update for learning rate 1.
Once that proposal is available, the exact controller is
$$ \tau_{\mathrm{raw}} = \eta_{\mathrm{base}} \lVert u_{\mathrm{unit}} \rVert, \qquad v = -\frac{u_{\mathrm{unit}}}{\lVert u_{\mathrm{unit}} \rVert}, \qquad \rho_{\mathrm{batch}} = \rho_a(v), $$
$$ \eta_{\mathrm{eff}} = \min\!\left(\eta_{\mathrm{base}}, \frac{r_{\mathrm{target}}\,\rho_{\mathrm{batch}}}{\lVert u_{\mathrm{unit}} \rVert}\right). $$
So the direction always stays Adam's direction. Only the step length is changed.
Code mapping for this notebook:
u_unitcomes fromadam_unit_step(...)||u_unit||isunit_norm\rho_a(v)is computed bybatch_rho_jvp(...)eta_effis written into Adam's parameter groups beforeopt.step()
4. Fixed learning-rate Adam¶
The baseline is ordinary Adam with a fixed base learning rate and no rho controller.
opt = torch.optim.Adam(model.parameters(), lr=base_lr)
opt.step()
This baseline is the right comparison because it isolates exactly what the controller changes:
- the optimizer direction is still Adam's direction,
- the exact controller only shortens the step when
tau_raw > r_target * rho_a, - if the base learning rate is already safe, the exact controller behaves like fixed-LR Adam.
def evaluate(model, loader):
model.eval()
total_loss = 0.0
total_correct = 0
total_count = 0
with torch.no_grad():
for xb, yb in loader:
xb = xb.to(DEVICE)
yb = yb.to(DEVICE)
logits = model(xb)
loss = F.cross_entropy(logits, yb)
total_loss += float(loss.item()) * len(xb)
total_correct += int((logits.argmax(dim=1) == yb).sum())
total_count += len(xb)
return total_loss / total_count, total_correct / total_count
def adam_unit_step(model, opt):
group_by_param = {}
for group in opt.param_groups:
for p in group['params']:
group_by_param[p] = group
step_vecs = []
sq = 0.0
for p in model.parameters():
if p not in group_by_param or p.grad is None:
z = torch.zeros(p.numel(), device=p.device, dtype=p.dtype)
step_vecs.append(z)
continue
group = group_by_param[p]
beta1, beta2 = group['betas']
eps = float(group['eps'])
wd = float(group.get('weight_decay', 0.0))
maximize = bool(group.get('maximize', False))
g = p.grad.detach()
if maximize:
g = -g
state = opt.state.get(p, {})
if 'exp_avg' in state:
m_prev = state['exp_avg']
v_prev = state['exp_avg_sq']
step_prev_obj = state.get('step', 0)
step_prev = int(step_prev_obj.item()) if torch.is_tensor(step_prev_obj) else int(step_prev_obj)
else:
m_prev = torch.zeros_like(p)
v_prev = torch.zeros_like(p)
step_prev = 0
step_t = step_prev + 1
m_t = beta1 * m_prev + (1.0 - beta1) * g
v_t = beta2 * v_prev + (1.0 - beta2) * (g * g)
bc1 = 1.0 - (beta1 ** step_t)
bc2 = 1.0 - (beta2 ** step_t)
denom = v_t.sqrt() / math.sqrt(max(bc2, 1e-16)) + eps
adam_term = (m_t / max(bc1, 1e-16)) / denom
update = adam_term
if wd != 0.0:
update = update + wd * p.detach()
step_vecs.append(update.flatten())
sq += float((update * update).sum().item())
step_vec = torch.cat(step_vecs)
norm = math.sqrt(sq)
return step_vec, norm
def batch_rho_jvp(model, xb, v_flat):
params = dict(model.named_parameters())
tangents = {}
offset = 0
for name, p in params.items():
numel = p.numel()
tangents[name] = v_flat[offset:offset + numel].view_as(p)
offset += numel
def fwd(pdict):
return functional_call(model, pdict, (xb,))
was_training = model.training
model.eval()
_, dlogits = jvp(fwd, (params,), (tangents,))
if was_training:
model.train()
spread = dlogits.max(dim=1).values - dlogits.min(dim=1).values
delta_a = float(spread.max().item())
return math.pi / max(delta_a, 1e-12)
def run_training(mode: str, base_lr: float, seed: int):
set_seed(seed)
train_loader, test_loader = build_digits_loaders(seed, BATCH_SIZE)
model = make_model()
opt = torch.optim.Adam(model.parameters(), lr=base_lr)
history = {
'train_loss': [],
'test_acc': [],
'max_r': [],
'mean_r': [],
'mean_eff_lr': [],
'mean_rho': [],
}
for _ in range(EPOCHS):
model.train()
batch_losses = []
batch_r = []
batch_eff_lr = []
batch_rho = []
for xb, yb in train_loader:
xb = xb.to(DEVICE)
yb = yb.to(DEVICE)
opt.zero_grad(set_to_none=True)
logits = model(xb)
loss = F.cross_entropy(logits, yb)
loss.backward()
unit_step_vec, unit_norm = adam_unit_step(model, opt)
if unit_norm < 1e-12:
eff_lr = float(base_lr)
for group in opt.param_groups:
group['lr'] = eff_lr
opt.step()
rho_a = float('inf')
r = 0.0
else:
v_dir = -unit_step_vec / unit_norm
rho_a = batch_rho_jvp(model, xb, v_dir)
if mode == 'exact':
eff_lr = min(base_lr, TARGET_R * rho_a / unit_norm)
elif mode == 'fixed':
eff_lr = float(base_lr)
else:
raise ValueError(mode)
for group in opt.param_groups:
group['lr'] = eff_lr
opt.step()
tau = eff_lr * unit_norm
r = tau / rho_a if rho_a > 0 else float('inf')
batch_losses.append(float(loss.item()))
batch_r.append(float(r))
batch_eff_lr.append(float(eff_lr))
batch_rho.append(float(rho_a if math.isfinite(rho_a) else 0.0))
_, test_acc = evaluate(model, test_loader)
history['train_loss'].append(float(np.mean(batch_losses)))
history['test_acc'].append(float(test_acc))
history['max_r'].append(float(np.max(batch_r)))
history['mean_r'].append(float(np.mean(batch_r)))
history['mean_eff_lr'].append(float(np.mean(batch_eff_lr)))
history['mean_rho'].append(float(np.mean(batch_rho)))
return history
RUN_SPECS = [
{'key': 'exact_low', 'mode': 'exact', 'base_lr': LOW_LR, 'label': f'exact Adam+rho (LR={LOW_LR})'},
{'key': 'exact_mid', 'mode': 'exact', 'base_lr': MID_LR, 'label': f'exact Adam+rho (LR={MID_LR})'},
{'key': 'exact_high', 'mode': 'exact', 'base_lr': HIGH_LR, 'label': f'exact Adam+rho (LR={HIGH_LR})'},
{'key': 'fixed_low', 'mode': 'fixed', 'base_lr': LOW_LR, 'label': f'fixed-LR Adam (LR={LOW_LR})'},
{'key': 'fixed_mid', 'mode': 'fixed', 'base_lr': MID_LR, 'label': f'fixed-LR Adam (LR={MID_LR})'},
{'key': 'fixed_high', 'mode': 'fixed', 'base_lr': HIGH_LR, 'label': f'fixed-LR Adam (LR={HIGH_LR})'},
]
def final_summary_row(spec, hist):
return {
'run': spec['label'],
'final_acc': hist['test_acc'][-1],
'peak_r': max(hist['max_r']),
'final_mean_r': hist['mean_r'][-1],
'final_mean_eff_lr': hist['mean_eff_lr'][-1],
'final_mean_rho': hist['mean_rho'][-1],
}
5. Single-seed run¶
We now run the exact controller and fixed-LR Adam on the same model and split.
The comparison to watch is straightforward:
- the exact controller should drive the realized step ratio toward
r ~= 1once the base learning rate is large enough to matter, - fixed-LR Adam may still take realized steps with
r >> 1because it never shortens Adam's proposed step.
single_seed = {}
for spec in RUN_SPECS:
single_seed[spec['key']] = run_training(spec['mode'], spec['base_lr'], SEED)
summary_rows = [final_summary_row(spec, single_seed[spec['key']]) for spec in RUN_SPECS]
display_table(
summary_rows,
columns=['run', 'final_acc', 'peak_r', 'final_mean_r', 'final_mean_eff_lr', 'final_mean_rho'],
formats={
'final_acc': lambda x: f"{float(x):.3f}",
'peak_r': lambda x: f"{float(x):.3f}",
'final_mean_r': lambda x: f"{float(x):.3f}",
'final_mean_eff_lr': lambda x: f"{float(x):.6f}",
'final_mean_rho': lambda x: f"{float(x):.6f}",
},
)
| run | final_acc | peak_r | final_mean_r | final_mean_eff_lr | final_mean_rho |
|---|---|---|---|---|---|
| exact Adam+rho (LR=0.001) | 0.969 | 0.358 | 0.133 | 0.001000 | 0.200699 |
| exact Adam+rho (LR=0.5) | 0.980 | 1.000 | 1.000 | 0.023500 | 0.302561 |
| exact Adam+rho (LR=2.0) | 0.980 | 1.000 | 1.000 | 0.023500 | 0.302561 |
| fixed-LR Adam (LR=0.001) | 0.969 | 0.358 | 0.133 | 0.001000 | 0.200699 |
| fixed-LR Adam (LR=0.5) | 0.189 | 4971.622 | 53.704 | 0.500000 | 0.023873 |
| fixed-LR Adam (LR=2.0) | 0.100 | 283819.042 | 0.173 | 2.000000 | 2.865052 |
6. First look: loss and accuracy¶
Start with the most legible outputs:
- training loss on a log scale,
- test accuracy on a linear scale.
Reading key:
- color = method (
exactvsfixed LR), - line style = base learning rate,
- markers help when curves overlap.
apply_plot_style(font_size=10, title_size=12, label_size=10, tick_size=9)
epochs = np.arange(1, EPOCHS + 1)
style_map = {
'exact_low': {'color': PALETTE['exact'], 'ls': '-', 'marker': 'o', 'alpha': 0.82},
'exact_mid': {'color': PALETTE['exact'], 'ls': '--', 'marker': 'o', 'alpha': 0.82},
'exact_high': {'color': PALETTE['exact'], 'ls': '-.', 'marker': 'o', 'alpha': 0.82},
'fixed_low': {'color': PALETTE['fixed'], 'ls': '-', 'marker': 's', 'alpha': 0.82},
'fixed_mid': {'color': PALETTE['fixed'], 'ls': '--', 'marker': 's', 'alpha': 0.82},
'fixed_high': {'color': PALETTE['fixed'], 'ls': '-.', 'marker': 's', 'alpha': 0.82},
}
fig, axes = plt.subplots(1, 2, figsize=(11.8, 4.4), sharex=True)
loss_labels = []
acc_labels = []
for spec in RUN_SPECS:
key = spec['key']
hist = single_seed[key]
style = style_map[key]
axes[0].semilogy(
epochs, hist['train_loss'],
color=style['color'], ls=style['ls'], lw=2.2,
marker=style['marker'], ms=4.5, markevery=2, alpha=style['alpha']
)
axes[1].plot(
epochs, hist['test_acc'],
color=style['color'], ls=style['ls'], lw=2.2,
marker=style['marker'], ms=4.5, markevery=2, alpha=style['alpha']
)
weight = 'bold' if key.startswith('exact') else None
loss_labels.append((hist['train_loss'][-1], spec['label'], style['color'], weight))
acc_labels.append((hist['test_acc'][-1], spec['label'], style['color'], weight))
axes[0].set_title('Exact step control keeps Adam stable at aggressive LR', loc='left', fontweight='bold')
add_subtitle(axes[0], "Both runs use Adam; the controller only shortens the proposed step when it is too long.", fontsize=9)
axes[1].set_title('Accuracy stays higher when the controller enforces tau <= rho', loc='left', fontweight='bold')
add_subtitle(axes[1], 'Fixed-LR Adam overshoots once the proposed step exceeds the local radius.', fontsize=9)
axes[0].set_ylabel('training loss')
axes[1].set_ylabel('test accuracy')
axes[1].set_ylim(0.0, 1.02)
format_percent_axis(axes[1], xmax=1.0)
for ax in axes:
ax.set_xlabel('epoch')
ax.grid(True, alpha=0.25)
add_end_labels(axes[0], epochs, loss_labels, fontsize=7)
add_end_labels(axes[1], epochs, acc_labels, fontsize=7)
fig.suptitle('Exact directional Adam controller versus fixed-LR Adam', y=0.99, fontsize=12, fontweight='bold')
finish_figure(fig, rect=[0, 0, 1, 0.94])
plt.show()
7. Diagnostic view: effective learning rate and realized step ratio¶
Now inspect the controller quantities directly.
The left panel shows the mean effective learning rate used over each epoch.
The right panel shows the realized normalized step size
r = tau / rho_a measured from the actual Adam update direction.
This is the most important panel in the notebook: the exact controller should lock onto the geometric target, while fixed-LR Adam can drift far above it.
fig, axes = plt.subplots(1, 2, figsize=(11.8, 4.4), sharex=True)
eta_labels = []
r_labels = []
for spec in RUN_SPECS:
key = spec['key']
hist = single_seed[key]
style = style_map[key]
axes[0].semilogy(
epochs, hist['mean_eff_lr'],
color=style['color'], ls=style['ls'], lw=2.1,
marker=style['marker'], ms=4.0, markevery=2, alpha=0.72
)
axes[1].semilogy(
epochs, hist['max_r'],
color=style['color'], ls=style['ls'], lw=2.1,
marker=style['marker'], ms=4.0, markevery=2, alpha=0.72
)
weight = 'bold' if key.startswith('exact') else None
eta_labels.append((hist['mean_eff_lr'][-1], spec['label'], style['color'], weight))
r_labels.append((hist['max_r'][-1], spec['label'], style['color'], weight))
axes[0].set_title('The controller reduces effective LR only when Adam would overshoot', loc='left', fontweight='bold')
add_subtitle(axes[0], "At safe LR the exact and fixed runs are nearly the same; at larger LR the controller backs off.", fontsize=9)
axes[1].set_title('Exact control keeps the realized step near the target boundary', loc='left', fontweight='bold')
add_subtitle(axes[1], 'Fixed-LR Adam can drive r far above 1 at the same base learning rate.', fontsize=9)
axes[0].set_ylabel('effective learning rate')
axes[1].set_ylabel('max r per epoch')
axes[1].axhline(TARGET_R, color=PALETTE['dark'], ls=':', lw=1.2)
for ax in axes:
ax.set_xlabel('epoch')
ax.grid(True, alpha=0.25, which='both')
add_end_labels(axes[0], epochs, eta_labels, fontsize=7)
add_end_labels(axes[1], epochs, r_labels, fontsize=7)
fig.suptitle('Adam controller diagnostics', y=0.99, fontsize=12, fontweight='bold')
finish_figure(fig, rect=[0, 0, 1, 0.94])
plt.show()
8. Summary table¶
The figure gives the visual story. The table below shows the same comparison numerically.
single_seed = {}
for spec in RUN_SPECS:
single_seed[spec['key']] = run_training(spec['mode'], spec['base_lr'], SEED)
summary_rows = [final_summary_row(spec, single_seed[spec['key']]) for spec in RUN_SPECS]
display_table(
summary_rows,
columns=['run', 'final_acc', 'peak_r', 'final_mean_r', 'final_mean_eff_lr', 'final_mean_rho'],
formats={
'final_acc': lambda x: f"{float(x):.3f}",
'peak_r': lambda x: f"{float(x):.3f}",
'final_mean_r': lambda x: f"{float(x):.3f}",
'final_mean_eff_lr': lambda x: f"{float(x):.6f}",
'final_mean_rho': lambda x: f"{float(x):.6f}",
},
)
| run | final_acc | peak_r | final_mean_r | final_mean_eff_lr | final_mean_rho |
|---|---|---|---|---|---|
| exact Adam+rho (LR=0.001) | 0.969 | 0.358 | 0.133 | 0.001000 | 0.200699 |
| exact Adam+rho (LR=0.5) | 0.980 | 1.000 | 1.000 | 0.023500 | 0.302561 |
| exact Adam+rho (LR=2.0) | 0.980 | 1.000 | 1.000 | 0.023500 | 0.302561 |
| fixed-LR Adam (LR=0.001) | 0.969 | 0.358 | 0.133 | 0.001000 | 0.200699 |
| fixed-LR Adam (LR=0.5) | 0.189 | 4971.622 | 53.704 | 0.500000 | 0.023873 |
| fixed-LR Adam (LR=2.0) | 0.100 | 283819.042 | 0.173 | 2.000000 | 2.865052 |
9. Optional: multiseed check¶
If you want a more stable comparison, flip RUN_MULTI_SEED = True and rerun the next cell.
The aggregated table reports mean final accuracy and median peak step ratio across seeds.
if RUN_MULTI_SEED:
grouped_rows = []
for spec in RUN_SPECS:
rows = []
for seed in MULTI_SEEDS:
hist = run_training(spec['mode'], spec['base_lr'], seed)
rows.append({
'final_acc': hist['test_acc'][-1],
'peak_r': max(hist['max_r']),
})
grouped_rows.append({
'run': spec['label'],
'final_acc_mean': float(np.mean([row['final_acc'] for row in rows])),
'final_acc_std': float(np.std([row['final_acc'] for row in rows], ddof=1)) if len(rows) > 1 else 0.0,
'peak_r_median': float(np.median([row['peak_r'] for row in rows])),
})
display_table(
grouped_rows,
columns=['run', 'final_acc_mean', 'final_acc_std', 'peak_r_median'],
formats={
'final_acc_mean': lambda x: f"{float(x):.3f}",
'final_acc_std': lambda x: f"{float(x):.3f}",
'peak_r_median': lambda x: f"{float(x):.3f}",
},
)
else:
print('Set RUN_MULTI_SEED = True to execute the multiseed check.')
Set RUN_MULTI_SEED = True to execute the multiseed check.
10. What to remember¶
- The correct way to apply Adam +
rhois to build Adam's actual proposed stepu_unitfirst. - Then compute
rho_aalong Adam's proposed directionv = -u_unit / ||u_unit||. - Then set the effective learning rate to
eta_eff = min(eta_base, r_target * rho_a / ||u_unit||)
so the realized step length satisfies tau <= r_target * rho_a.
- Relative to fixed-LR Adam, the controller keeps Adam's direction and only shortens the step when the proposal is too long for the local radius.
For another optimizer, the template is the same:
- let the optimizer propose an update,
- extract its direction and norm,
- compute
rho_aalong that direction, - shorten the step only if it is too long.