Associative Memory, States, Attention, Optimization:
A Unifying Perspective

March 2026 · Olivier Bousquet

Work in progress. This post is an early draft. Some claims may be imprecise or incomplete, and the structure is still evolving. Corrections and feedback are welcome.

Contents

The same algorithm, three axes
The sequence axis as associative memory
The general framework
Regularized cost · Generating function $h$ · Cost hierarchy · Write strategies · Monoid & PE · Key geometry · Taxonomy · Equation tables
The depth axis
The data axis
Every gradient is an outer product · LR = PE · Critical batch size · Adam as diagonal Shampoo
Cross-axis transfer
Where the analogy breaks
Open questions

The same algorithm, three axes

In 1960, Widrow and Hoff introduced the delta rule [1]: given a target $y$ and input $x$, update a weight matrix by the error-corrective rule

$$W \leftarrow W + \eta\,(y - Wx)\,x^\top$$

This is gradient descent on the squared loss $\frac{1}{2}\|y - Wx\|^2$, applied once per training example. Over six decades it became a cornerstone of adaptive filtering and online learning, known as the LMS or Widrow–Hoff rule.

In 2024, Yang et al. introduced DeltaNet [2], a sequence model that updates a memory matrix once per token:

$$S_t \leftarrow S_{t-1} + (v_t - S_{t-1} k_t)\, k_t^\top$$

DeltaNet knowingly borrows the delta rule — the paper cites Widrow and Hoff. But what has not been fully appreciated is how exact the correspondence is. This is literally the same algorithm with a change of variable names: keys $k_t$ play the role of inputs $x$, values $v_t$ play the role of targets $y$, and the memory $S_t$ plays the role of the weight matrix $W$ (with normalized keys and learning rate $\eta = 1$). The algebra is identical; only the axis of computation changes — from training examples to tokens.

Same algorithm, different axis. This turns out to be the tip of an iceberg. The same pattern — one operation appearing independently along different axes — repeats across a surprising range of techniques.

This post is about that structure. Large language models perform iterative state updates along three axes of computation — sequence, depth, and data — that were studied largely in isolation as separate topics. Behrouz et al. [12] recently unified sequence-axis architectures as associative memories within a regularized optimization framework. We show that this extends to all three axes, and that the Bregman generating function $h$ additionally unifies retention and normalization — two design choices previously treated as independent.

The common pattern is a state update: retain part of the old state and add new information. In its simplest form:

$$S_t = A_t \circ S_{t-1} + v_t\, k_t^\top$$

where $A_t$ controls forgetting and $v_t k_t^\top$ writes a new key-value association. This appears across all three axes:

System	State	Shape	Retention	Write term	Index
GRU^†	Hidden $h_t$	vector	$\text{diag}(1-z_t)$ (gate)	$z_t \odot \tilde{h}_t$	Time $t$
SSM (S4, Mamba)	Latent $s_t$	vector	$\bar{A}$ (dynamics, diagonal)	$\bar{B} u_t$	Time $t$
Residual network	$x^{(\ell)}$	vector	$I$ (identity)	$f(x^{(\ell-1)})$	Layer $\ell$
Momentum	Buffer $m$	vector	$\beta I$	$(1-\beta)g_n$	Step $n$
Linear attention	KV memory $S_t$	matrix	$I$ (no forget)	$v_t k_t^\top$	Token $t$
SGD + weight decay	Weights $W$	matrix	$(1{-}\lambda\eta)I$	$\eta\, g_n$	Step $n$

^† A vanilla RNN uses $h_t = \sigma(W_h h_{t-1} + W_x x_t)$, which does not cleanly separate retention from writing since $W_h$ is a full learned matrix and the nonlinearity wraps both terms. The GRU isolates retention via a diagonal gate $\text{diag}(1-z_t)$.

But this shared form hides a deeper unity: as we will show, all of these are special cases of a single regularized optimization framework.

Sequence axis ($t$ = token): the KV cache accumulates information about past tokens. This is working memory.
Depth axis ($\ell$ = layer): the residual stream is refined by each layer. This is processing memory.
Data axis ($n$ = example): the weight matrix is updated by each gradient step. This is long-term memory.

Feed-forward network (FFN) layers are themselves associative memories [3]: the weight matrices store key-value associations learned during training. Training — updating $W$ via gradient descent — is precisely the state update applied along the data axis.

The sequence axis as associative memory

We develop the framework first on the sequence axis, where the connection to associative memory is most direct. Each subsection introduces one dimension of the memory system — state, cost, algorithm, retention, capacity, duality, structure — which the general framework then generalizes to all three axes.

Notation. Keys $k_t \in \mathbb{R}^{d_k}$, values $v_t \in \mathbb{R}^{d_v}$, queries $q \in \mathbb{R}^{d_k}$. The memory state is a matrix $S_t \in \mathbb{R}^{d_v \times d_k}$ (or a vector when $d_k = 1$). The sequence length is $N$ and the number of attention heads is $H$. For $N$ key-value pairs, $K = [k_1, \ldots, k_N]^\top \in \mathbb{R}^{N \times d_k}$ and $V = [v_1, \ldots, v_N]^\top \in \mathbb{R}^{N \times d_v}$.

Write, read, and the key-value abstraction

An associative memory stores key-value pairs $(k_i, v_i)$ and retrieves values given a query $q$. The memory state is a matrix $S \in \mathbb{R}^{d_v \times d_k}$ updated via rank-one writes:

$$S_t = A_t \circ S_{t-1} + v_t\, k_t^\top$$

where $A_t$ controls forgetting and retrieval is $y = S_t\, q$. A vector state is the special case $d_k = 1$: there is no key, just a value being accumulated.

How the state interfaces with the network. When the state is a vector $s \in \mathbb{R}^d$, the network interacts with it through general linear maps: reading is $y = W_r\, s$ and writing is $\Delta s = W_w\, z$, both unconstrained matrix multiplies.

When the state is a matrix $S \in \mathbb{R}^{d_v \times d_k}$, the interface is more structured. Reading with a query $q$ gives $y = Sq = \sum_j q_j\, S_{:,j}$. In terms of the flattened state $\mathrm{vec}(S) \in \mathbb{R}^{d_v d_k}$:

$$y = (q^\top \otimes I_{d_v})\,\mathrm{vec}(S)$$

This is a matrix with $d_k$ blocks, each being $q_j I_{d_v}$—a scalar times the identity. Unlike a general linear map from $\mathbb{R}^{d_v d_k}$ to $\mathbb{R}^{d_v}$ (which would require $d_v^2 d_k$ parameters), the read is parameterized by just $d_k$ values. Within each key dimension, all $d_v$ value components are treated identically.

The write is similarly structured: $\Delta S = v\, k^\top$, i.e., $\mathrm{vec}(\Delta S) = k \otimes v$—a Kronecker product. When key and value are linear projections of the input $x$, the write becomes $\mathrm{vec}(\Delta S) = (W_K x) \otimes (W_V x)$, a quadratic feature of the input, in contrast to the linear $\Delta s = W\, x$ for a vector state.

The matrix state trades generality for capacity: its Kronecker-structured read/write cannot express arbitrary linear transformations of the flattened state, but it can store $O(d_k)$ associations instead of $O(1)$. The output projection $W_O$ applied after retrieval provides the unconstrained mixing that the structured read alone cannot.

The projection spectrum. More broadly, the linear map from state to attention inputs (queries, keys, values) is always a structured matrix. Different architectures choose different points on a spectrum from dense to maximally constrained:

Structure	Params	Example	Map
Dense	$O(d \cdot d_k)$	Standard transformer	$k = W_K\, x$
Low-rank	$O(d \cdot d_c)$	MLA (DeepSeek-V2)	$k = W_K^U(W_{DKV}\, x)$
Block-diagonal	$O(d \cdot d_h)$	Multi-head attn	Per-head $k^{(h)} = W_K^{(h)} x^{(h)}$
Coarse block-diag.	$O(d \cdot d_h/g)$	GQA	Fewer KV heads, shared across $g$ query heads
Kronecker	$O(d_k + d_v)$	Matrix state read	$y = (q^\top \otimes I_{d_v})\,\mathrm{vec}(S)$

At the dense end, a standard transformer uses unconstrained $W_K, W_V \in \mathbb{R}^{d_k \times d}$ to project the residual stream into key-value space. Multi-head Latent Attention (MLA) compresses the residual stream into a shared low-rank latent $c_t = W_{DKV}\, x_t \in \mathbb{R}^{d_c}$, from which per-head keys and values are recovered: $k_t^{(h)} = W_{K,h}^U\, c_t$. The combined map $k_t^{(h)} = W_{K,h}^U W_{DKV}\, x_t$ is a low-rank factorization of $W_K$, reducing the KV cache from $O(H d_h)$ to $O(d_c)$ per token. Multi-head attention uses a block-diagonal projection: each head has its own $W_K^{(h)}$ acting on a disjoint slice of the residual stream. GQA coarsens this: $g$ query heads share each KV head, so the KV projection has $H/g$ blocks—interpolating between full multi-head ($g = 1$) and multi-query attention ($g = H$, a single shared KV head). At the Kronecker end, the matrix-state read is parameterized by just $d_k$ values.

The same spectrum appears on the depth axis. When the residual state is a matrix $M \in \mathbb{R}^{p \times q}$ (as in residual matrix transformers), the projections that produce attention inputs from this state inherit Kronecker structure. A low-rank projection from the matrix state (compressing $M$ to a vector before projecting) would be the depth-axis analogue of MLA. With multi-head attention, the structure is doubly constrained: block-diagonal across heads, and Kronecker-structured within each head.

What to write: the cost

Different write rules correspond to different cost functions that the write operation implicitly minimizes. This cost-update duality provides the deeper "why" for the shared structure across axes.

Additive (Hebbian). $S_t = S_{t-1} + v_t k_t^\top$ ignores what is already stored. The corresponding cost is a linear alignment: $\ell_t(S) = -\langle v_t, S k_t \rangle$. Its gradient with respect to $S$ is $\nabla_S \ell_t = -v_t k_t^\top$, so a single gradient descent step with unit step size gives $S_t = S_{t-1} - \nabla_S \ell_t = S_{t-1} + v_t k_t^\top$, recovering the Hebbian update. Over $N$ steps from $S_0 = 0$: $S_N = \sum_{i=1}^{N} v_i k_i^\top = V^\top K$, which is the unnormalized linear attention memory.

Error-corrective (delta rule). $S_t = S_{t-1} + (v_t - S_{t-1} k_t) k_t^\top$ writes only the prediction error. The corresponding cost is MSE: $\ell_t = \frac{1}{2}\|v_t - S k_t\|^2$. On the data axis, this is exactly the Widrow–Hoff / LMS rule [28].

Log-sum-exp (softmax attention). The connection to attention can be made precise through energy functions in the Hopfield framework. A Hopfield network stores $N$ patterns (keys $k_1, \ldots, k_N$) and retrieves by minimizing an energy $E(q)$ over the query. The retrieval rule is $q^* = -\nabla_q E(q)$ at a fixed point. Different energies yield different retrieval mechanisms:

Model	Energy $E(q)$	Retrieval $-\nabla_q E$	Capacity
Classical Hopfield	$-\frac{1}{2}\\|Kq\\|^2$	$K^\top K\, q$ → sign	$O(d)$
Linear attention	$-\mathbf{1}^\top K q$	$\sum_i k_i$	$O(1)$
Softmax attention	$-\log \sum_i e^{k_i^\top q}$	$\sum_i \text{softmax}(Kq)_i\, k_i$	$e^{O(d)}$

The key insight is how per-pattern similarities $k_i^\top q$ are aggregated: summing their squares (classical Hopfield) gives polynomial capacity; the log-sum-exp aggregation (modern Hopfield) gives exponential capacity because it is dominated by the largest similarity. Ramsauer et al. [6] showed that softmax attention is exactly one retrieval step of this modern Hopfield network: with values attached to keys, the output is $y = \sum_i \text{softmax}(Kq)_i\, v_i$, the standard attention formula. On the data axis, the loss function is the energy and gradient descent is the retrieval; the log-sum-exp cost is cross-entropy with softmax outputs.

A striking pattern: the form of the cost determines the capacity. Linear → $O(1)$/step. Quadratic (MSE) → $O(d)$. Log-sum-exp → $e^{O(d)}$. The same cost forms appear on all three axes.

How to write: the optimization algorithm

The cost determines what to optimize; the optimization algorithm determines how. Rewriting the delta rule as $S_t = S_{t-1}(I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top$ with $\beta_t = 1/\|k_t\|^2$ reveals that the transition $(I - \beta_t k_t k_t^\top)$ acts on the key space by right-multiplication. With unit-norm keys ($\beta_t = 1$), this is an orthogonal projection onto $k_t^\perp$ that removes the component in direction $k_t$. (With $\beta_t = 2/\|k_t\|^2$ it would be a Householder reflection.) Unrolling the recurrence, each stored key is transformed by an accumulated product of projections — an online Gram-Schmidt process that orthogonalizes the effective keys, which is why the delta rule achieves higher capacity than additive writes.

When iterated over a fixed set of key-value pairs (multiple epochs on the data axis), the delta rule converges to the pseudoinverse $S^* = VK^+$ — this is the Kaczmarz method [28]. On the sequence axis, a single pass only approximates it.

The algorithm sits in a hierarchy: one-step (Hebbian) → GD (delta rule) → GD+momentum (Titans) → Newton (Longhorn / natural gradient) → FTRL (full attention / batch optimization). The same hierarchy appears on all axes.

Retention and forgetting

The retention operator $A_t$ controls how old information decays. Afzal [5] established a fundamental duality: forget gates in linear recurrences are positional encodings in softmax transformers. The cumulative product $P_{j \to t} = \prod_{s=j+1}^{t} A_s$ defines a distance $d(j,t) = -\log\|P_{j \to t}\| = \sum_{s=j+1}^t D_s$ between positions. This distance satisfies non-negativity ($d(j,t) \geq 0$), identity ($d(t,t) = 0$), and the chain rule $d(j,t) = d(j,s) + d(s,t)$ for $j < s < t$ — a strengthened triangle inequality. It is in general asymmetric ($d(j,t) \neq d(t,j)$), appropriate for causal modeling. The monoid of transition matrices determines the geometry:

Transition monoid	Recurrence	Positional encoding
$\{I\}$ (trivial)	Linear attention	NoPE
$\mathbb{R}_{>0}$ (scalars)	RetNet	ALiBi
Diagonal matrices	Mamba2 / GLA	FoX [16]
$\mathrm{SO}(2)$ (rotations)	Complex-valued gates (Mamba-3 [18])	RoPE
Householder reflections	DeltaNet	PaTH [17]

The monoid structure induces a classification of PEs into additive and multiplicative forms. Scalar and diagonal monoids act on attention logits as additive biases: $\mathrm{Att}_{t,j} = q_t^\top k_j + d(j,t)$ (ALiBi, FoX). Rotation and Householder monoids act multiplicatively on keys: $\mathrm{Att}_{t,j} = q_t^\top P_{j \to t}\, k_j$ (RoPE, PaTH). Commutative monoids with $\|A_s\| < 1$ contribute additively in the log domain; orthogonal monoids ($\|A_s\| = 1$) preserve norms and act multiplicatively on the key-query inner product.

A crucial distinction: data-independent transitions (RoPE's rotation, RetNet's scalar decay) preserve the Gram matrix of stored keys — they encode position without changing correlation structure. Data-dependent transitions (DeltaNet's Householder, Mamba's diagonal gate) actively reshape key geometry — DeltaNet orthogonalizes keys, Mamba selectively attenuates dimensions. This determines whether the monoid serves as a positional encoding (data-independent) or active memory management (data-dependent).

Capacity and the SNR

When retrieving from $M = \sum_i v_i k_i^\top$ with query $q = k_*$ (the target pattern), the output decomposes into signal $(k_*^\top q) v_*$ plus interference from other stored associations. For $N$ random patterns of dimension $d$, the retrieval signal-to-noise ratio is $\mathrm{SNR}_{\text{linear}} = d/N$. The memory saturates at $N_{\max} \sim d$ patterns.

Softmax attention SNR. For softmax attention with temperature $\tau$, querying with $q = k_*$ gives weight ratio $w_j / w_* = \exp(-\Delta_j / \tau)$ where $\Delta_j = \|k_*\|^2 - k_*^\top k_j$ is the logit gap. The SNR becomes:

$$\mathrm{SNR}_{\text{softmax}} \gtrsim \frac{d_v \exp\!\bigl(2(d - c\sqrt{d \log N})/\tau\bigr)}{N}$$

The temperature $\tau$ is critical. Standard transformers use $\tau = \sqrt{d_k}$; the Hopfield setting [6] uses $\tau = O(1)$:

Setting	Temperature $\tau$	Capacity $N_{\max}$	SNR amplification
Linear attention	$\tau \to \infty$ (flat kernel)	$O(d_k)$	$\times 1$
Standard transformer	$\sqrt{d_k}$	$\exp(O(\sqrt{d_k}))$	$\times \exp(2\sqrt{d_k})$
Unscaled / Hopfield	$O(1)$	$\exp(O(d_k))$	$\times \exp(2d_k)$

The often-cited $N_{\max} \sim \exp(O(d))$ assumes $\tau = O(1)$. Standard transformers achieve $\exp(O(\sqrt{d_k}))$ — still super-polynomial, but a different exponent. In practice, even $\exp(\sqrt{d_k})$ is enormous: for $d_k = 128$, $\exp(2\sqrt{128}) \approx 4 \times 10^9$, far exceeding any practical sequence length. The capacity of softmax attention is never the bottleneck; the quadratic compute cost hits first.

The $\text{SNR} = d/N$ bound assumes random keys. When keys are correlated, the pseudoinverse $VK^+$ amplifies noise by $1/\sigma_{\min}(K)$. For nearly collinear keys, regularization (retention $\alpha < 1$) is more appropriate than exact orthogonalization.

The kernel choice determines the capacity-computation tradeoff. This tradeoff is universal [26]: fixed-size state (SSM, capacity $O(d)$) vs. growing token cache (Transformer, capacity $\exp(O(d/\tau))$).

Log-linear attention [21] provides a principled interpolation. Guo et al. observe that all efficient attention variants share the form $\mathbf{O} = (\mathbf{A} \odot \mathbf{M})\,\mathbf{V}$, where $\mathbf{A}$ is the attention-like matrix and $\mathbf{M}$ is a causal mask whose structure determines scaling: lower-triangular (linear attention, $O(1)$ decode), semiseparable (Mamba-2, $O(1)$ decode), or hierarchical (log-linear, $O(\log N)$ decode). The key idea is to replace the fixed-size state with $L = O(\log N)$ states $S_t^{(\ell)} \in \mathbb{R}^{d_v \times d_k}$, one per level of a Fenwick tree over the prefix $[0, t]$. The output is:

$$o_t = \sum_{\ell=0}^{L-1} \lambda_t^{(\ell)}\, q_t^\top S_t^{(\ell)}$$

where $\lambda_t^{(\ell)}$ are data-dependent weights for different temporal scales. When all $\lambda_t^{(\ell)}$ are equal, this reduces to ordinary linear attention. Any linear attention variant can be "lifted" to a log-linear variant by composing its mask with the hierarchical mask — Guo et al. demonstrate this for Mamba-2 and Gated DeltaNet.

Duality: recurrent and attention views

Attention (store everything). Keep all key-value pairs explicitly: $y = \sum_j w(q, k_j)\, v_j$. Lossless but costs $O(Td)$ per query.

State update (compress). Maintain a fixed-size matrix $S$ via rank-one writes. Retrieval is $O(d^2)$ regardless of stored items, but compression is lossy.

Even DeltaNet's nonlinear (delta-rule) update admits a parallel form. Yang et al. [2] showed that the recurrence $S_t = S_{t-1}(I - k_t k_t^\top) + v_t k_t^\top$ can be rewritten as $\mathbf{O} = \mathbf{A}\,\mathbf{V}$ with $\mathbf{A} = (QK^\top \odot L)(I + KK^\top \odot (L - I))^{-1} \odot \mathbf{M}$, where $L$ is the lower-triangular matrix of ones. The term $(I + KK^\top \odot (L - I))^{-1}$ captures the error-correction: it inverts out key correlations that a Hebbian write would ignore.

Dao and Gu [4] showed that the input-output map of a linear recurrence is a semiseparable matrix, and the recurrent and attention views are its two factorizations. The matrix mixer framework [27] generalizes this: all sequence models are $Y = MX$ where $M$ is a structured matrix. This duality extends to all three axes: Zhang [10] proved an exact depth-axis duality; the data-axis duality (online SGD vs. batch gradient) is a structural analogy.

A chunk (or mini-batch) interpolates: within a chunk, use attention; between chunks, compress into state. Mini-batch SGD is the data-axis analogue. On both axes, the optimal chunk size matches the memory capacity.

Multi-head: block-diagonal memory

Multi-head attention splits the $d$-dimensional space into $H$ independent heads of dimension $d_h = d/H$, giving a block-diagonal memory matrix. This transfers across axes: hyper-connections [9] are multi-head on the depth axis; dense MoE is multi-head on the data axis. GQA uses fewer KV heads than query heads — fewer memory blocks with more diverse reads.

The general framework

The examples above — from additive writes to softmax attention — share a common structure recently identified by Behrouz et al. [12]: each model corresponds to a different instantiation of a regularized optimization problem. We now generalize each dimension of the memory system — cost, algorithm, retention, normalization, capacity — to all three axes.

The regularized cost

The most general state update along any axis can be written as:

$$W_t = \arg\min_W \; \ell_t(W; k_t, v_t) + \frac{1}{\eta_t}\, D_h(W, W_{t-1})$$

where $\ell_t$ is the per-step cost, $\eta_t$ is the learning rate, and $D_h$ is a Bregman divergence with generating function $h$:

$$D_h(W, W') = h(W) - h(W') - \langle \nabla h(W'),\, W - W' \rangle$$

The Bregman term penalizes deviation from the previous state — it is the retention mechanism. To connect this cost to concrete state updates, we derive the update rules by solving or approximating the optimization problem.

Euclidean case. When $h(W) = \frac{1}{2}\|W\|_F^2$, the Bregman divergence is $D_h = \frac{1}{2}\|W - W'\|_F^2$. Setting the gradient of the cost to zero gives the implicit (proximal) update:

$$W_t = W_{t-1} - \eta_t\, \nabla \ell_t(W_t)$$

This is implicit because $W_t$ appears on both sides. The explicit (one-step gradient) approximation replaces $\nabla \ell_t(W_t)$ by $\nabla \ell_t(W_{t-1})$, recovering the state update $W_t \approx W_{t-1} - \eta_t\, \nabla \ell_t(W_{t-1}) = A_t \circ W_{t-1} + \psi(v_t, k_t)$. On the data axis this is SGD; on the sequence axis it is DeltaNet’s delta rule.

For the quadratic cost $\ell_t(W) = \frac{1}{2}\|v_t - Wk_t\|^2$, the implicit equation can be solved in closed form:

$$W_t = W_{t-1} + \frac{\eta_t}{1 + \eta_t \|k_t\|^2}(v_t - W_{t-1} k_t)\, k_t^\top$$

This is Longhorn: the exact minimizer of the per-step regularized cost. When $\|k_t\| = 1$ and $\eta_t \to \infty$, it reduces to the normalized delta rule (NLMS on the data axis). So DeltaNet (explicit GD) and Longhorn (exact/Newton) are two approximation levels of the same cost.

General case (mirror descent). For general $h$, the explicit approximation gives the mirror descent update:

$$W_t = \nabla h^*\!\bigl(\nabla h(W_{t-1}) - \eta_t\, \nabla \ell_t(W_{t-1})\bigr)$$

where $\nabla h^*$ is the link function mapping from dual to primal space: identity for Euclidean $h$ (standard GD), softmax for entropic $h$ (Memora), polar decomposition for Stiefel-constrained Euclidean (Muon).

Constrained dual (Frank-Wolfe). By Lagrangian duality, the regularized cost is equivalent to a hard constraint: minimize $\ell_t(W)$ subject to $W \in \mathcal{C}$, where the constraint radius is dual to $1/\eta_t$. The Frank-Wolfe (conditional gradient) algorithm solves this via:

$$s_t = \arg\min_{S \in \mathcal{C}} \langle \nabla \ell_t(W_{t-1}),\, S \rangle, \qquad W_t = (1 - \gamma_t)\, W_{t-1} + \gamma_t\, s_t$$

This is a state update with scalar retention $(1-\gamma_t)$ and a write term $\gamma_t s_t$ where the linear minimization oracle (LMO) returns a point on the boundary of $\mathcal{C}$. The LMO plays the same structural role as $\nabla h^*$: both map a gradient signal to a feasible update. The most general formulation subsumes both:

$$W_t = \arg\min_{W \in \mathcal{C}} \left\{\ell_t(W) + \frac{1}{\eta_t}\, D_h(W, W_{t-1})\right\}$$

where mirror descent is the unconstrained case ($\mathcal{C} = \mathbb{R}^{d \times d}$) and Frank-Wolfe operates when the hard constraint dominates the Bregman penalty. The framework’s pair $(h, \mathcal{M}_{\text{norm}})$ parameterizes both: $h$ determines the Bregman geometry, $\mathcal{M}_{\text{norm}}$ determines the constraint set.

The cost hierarchy

The cost forms introduced on the sequence axis generalize to all axes. The same cost applied to different axes gives corresponding systems:

Cost form	Sequence axis	Data axis	Capacity
Linear (alignment)	Linear attention	Hebbian	$O(1)$/step
$\ell_p$ norm ($p \geq 1$)	Moneta [12]	$\ell_p$ regression	$O(d)$
Quadratic (MSE)	DeltaNet	Widrow–Hoff / SGD	$O(d)$
Huber	Yaad [12]	Robust regression	$O(d)$
Log-sum-exp	Softmax attention	Cross-entropy	$e^{O(d)}$

The log-sum-exp row is key: softmax attention and cross-entropy training both arise from the same cost form. Both achieve exponential capacity through the same mechanism — the exponential nonlinearity amplifies the gap between matched and mismatched patterns.

The write-strategy hierarchy

The algorithm hierarchy introduced on the sequence axis generalizes across all axes:

Method	Algorithm	Sequence axis	Data axis
Additive	1 step	Linear attention	Hebbian
Error-corrective	GD	DeltaNet / Gated DeltaNet [15]	Widrow–Hoff / SGD
+ momentum	GD+mom.	Titans [12]	Adam
Stiefel proj.	GD	—	Muon
Full curvature	Newton	Longhorn	Natural gradient
Batch optimal	FTRL	Softmax attention	Pseudoinverse

GD with momentum on the sequence axis gives Titans [12], which maintains two state matrices (memory + momentum) — exactly as Adam maintains $(m_t, v_t)$ on the data axis. FTRL recovers full attention on the sequence axis and batch optimization on the data axis. Bernstein and Newhouse [23] show that Adam, Shampoo, and Muon are all steepest descent under different norms: Adam under the diagonal Mahalanobis norm, Lion [Lion] under $\ell_\infty$ (giving sign of momentum), and Muon under the spectral norm (giving polar decomposition). Each norm corresponds to a different generating function $h$ and hence a different normalization manifold. LUCID [24] applies preconditioning to the sequence-axis readout, achieving gains analogous to natural gradient on the data axis.

The choice of $h$ shapes not just convergence geometry but the implicit bias of the solution found: Muon’s Stiefel projection removes the simplicity bias (preference for low-rank solutions) that SGD and Adam preserve [D&R 2025].

The constrained dual introduced above is already established on the data axis [LM 2026]: Lion and Muon are stochastic Frank-Wolfe on norm-ball constraints — $\ell_\infty$ for Lion, spectral for Muon — with weight decay $\lambda$ setting the constraint radius $1/\lambda$ and providing scalar retention $(1-\lambda\eta)$. On the sequence axis, the same construction applies. For the quadratic cost, the gradient $g_t = -(v_t - S_{t-1}k_t)\,k_t^\top$ is rank-1. The LMO over a spectral- or nuclear-norm ball returns a normalized rank-1 matrix, and the Frank-Wolfe update becomes:

$$S_t = (1-\gamma_t)\, S_{t-1} + \gamma_t R \cdot \hat{e}_t\, \hat{k}_t^\top$$

a normalized delta rule with scalar forgetting: it writes a normalized error-key outer product while uniformly decaying the entire memory. This combines error correction (like DeltaNet), scalar forgetting (like weight decay), and normalized writes (like Muon) — a combination not present in any current sequence model. The key structural difference from mirror descent is the forgetting mechanism: Frank-Wolfe always gives scalar retention $(1-\gamma_t)$, while mirror descent inherits richer structure from the cost — rank-1 for DeltaNet, diagonal for Mamba.

Recent work also explores meta-modifications that sit on top of any base optimizer. Cautious optimizers [Liang 2024] mask update components whose sign disagrees with the current gradient — an input-dependent gate on the write operation, the data-axis analogue of input-dependent forget gates (Mamba, FoX) on the sequence axis. Where Mamba gates retention, cautious optimizers gate the write: both suppress components that the current input does not support.

The classical associative memory community independently discovered the same progression — from Hebbian to error-corrective (Storkey [11]) to exponential-kernel (modern Hopfield [6]) — before it appeared on the sequence axis. The capacity jumps at each step are the same: $O(d/\log d) \to O(d) \to e^{O(d)}$.

Retention, normalization, and the generating function $h$

The monoid structure captures the algebraic dimension of retention. The Bregman framework adds a geometric dimension: the generating function $h$ simultaneously determines how the state forgets and how it is normalized.

$h(W)$	Retention	Normalization	Manifold	Axis
$\frac{1}{2}\\|W\\|_F^2$	Scalar forget gate	RMSNorm	Sphere	Depth
$\sum w \log w$	Entropic (KL)	Softmax	Simplex	Sequence
$\frac{1}{2}\\|W\\|_F^2$ (Euclidean)	Weight decay	Polar decomp.	Stiefel	Data

The solution is computed via mirror descent: $W_t = \nabla h^*(\nabla h(W_{t-1}) - \eta_t \nabla \ell_t)$, where $\nabla h^*$ is the link function. For entropic $h$, $\nabla h^* = \mathrm{softmax}$, so the update is softmax normalization — this is exactly Memora [12]. For quadratic $h$, mirror descent reduces to standard GD, and projection onto the sphere gives RMSNorm. For Euclidean $h$ with a Stiefel manifold constraint, the projection is Muon's polar decomposition — same $h$ as RMSNorm, different manifold.

The monoid and $h$ are complementary: the monoid determines the algebraic structure of the transitions (scalar, diagonal, orthogonal), while $h$ determines the geometry of the retention (Euclidean, entropic). Together they fully specify how the memory forgets.

Retention and normalization are dual aspects of the same object. The generating function $h$ simultaneously determines how the state forgets (via $D_h$), how it is normalized (via $\nabla h^*$), and the geometry of the state space (via $\nabla^2 h$). The manifold $\mathcal{M}_{\text{norm}}$ is a separate choice determining what is projected onto. Together, $(h, \mathcal{M}_{\text{norm}})$ fully specify the normalization operation.

Key geometry and the chain from cost to capacity

The write strategy and retention jointly shape the effective Gram matrix $G_{ij} = \langle \tilde{k}_i, \tilde{k}_j \rangle$ of stored keys, which determines retrieval capacity. The optimal key geometry depends on the task:

Lookup (independent associations): orthogonal keys give zero cross-talk, maximum capacity $N_{\max} = d_k$. The delta rule drives $G \to I$ via online Gram-Schmidt.

Regression (correlated keys, smooth values): orthogonalization is counterproductive — it destroys the correlation structure needed for generalization. The right approach is regularized regression: $S^* = VK^\top(KK^\top + \lambda I)^{-1}$.

Retention = implicit regularization. Scalar retention $\alpha < 1$ implements ridge regression with $\lambda \sim 1 - \alpha^2$. The retention parameter controls the bias-variance tradeoff: $\alpha \to 1$ gives the pseudoinverse (exact but fragile); $\alpha \to 0$ gives strong regularization (robust but forgetful). This explains input-dependent gating in Mamba/GLA: the model learns to forget more when keys are correlated (regularize) and less when keys are orthogonal (maximize capacity). DeltaNet can use $\alpha = 1$ because the Householder is the regularizer.

Hierarchical keys (group structure): block-diagonal Gram matrix — correlated within groups, orthogonal across. This is exactly multi-head attention. On the data axis: weight decay is the retention, and block-diagonal structure is MoE.

The framework yields a complete chain from cost to capacity:

$$\mathcal{L} \;\xrightarrow{\;\nabla\;}\; \text{update rule} \;\xrightarrow{\;h\;}\; \text{retention + normalization} \;\xrightarrow{\;\nabla^2\;}\; \text{kernel} \;\xrightarrow{\;\text{capacity}\;}\; \text{SNR}$$

The full specification of a memory system is the tuple $(\mathcal{L}, h, \mathcal{M}, \mathcal{H})$: cost, generating function, monoid, and memory space. From these, the write rule, retention, normalization, and read operation all follow.

Taxonomy: models as tuples

The tables below classify existing models on each axis as instantiations of the $(\mathcal{L}, h, \mathcal{M})$ tuple, extending the taxonomies of Yang et al. [2] and Behrouz et al. [12] with the generating function and monoid columns.

Sequence axis. Models above the line use matrix memory ($S \in \mathbb{R}^{d_v \times d_k}$); those below use MLP memory.

Model	Cost $\mathcal{L}$	$h$	Monoid $\mathcal{M}$	Alg.
Linear Attn	Alignment	Eucl.	$\{I\}$	1-step
RetNet	Dot-prod.	Eucl.	Scalar	GD
Mamba / Mamba2	Dot-prod.	Eucl.	Diagonal	GD
GLA	Dot-prod.	Eucl.	Diagonal	GD
Mamba-3	Dot-prod.	Eucl.	SO(2)	GD
DeltaNet	MSE	Eucl.	Householder	GD
Gated DeltaNet	MSE	Eucl.	HH × scalar	GD
RWKV-7	MSE	Eucl.	Diag × HH	GD
Longhorn	MSE	Eucl.	—	Newton
Moneta	$\ell_p$	$\ell_q$	—	GD
Yaad	Huber	Eucl.	—	GD
Memora	MSE	Entropic	—	GD
FoX	Log-sum-exp	—	Diagonal	FTRL
Softmax Attn	Log-sum-exp	—	—	FTRL
TTT-Linear	MSE	Eucl.	—	GD
TTT-MLP	MSE	Eucl.	—	GD
Titans	MSE	Eucl.	Scalar	GD+mom.

Depth axis. No model on the depth axis has an independent per-step cost.

Model	$h$ (Norm.)	Monoid $\mathcal{M}$	Memory	Update rule
Residual connection	—	$\{I\}$	$\mathbb{R}^d$	$x + f(x)$
Pre-Norm	Eucl. (RMSNorm)	$\{I\}$	$\mathbb{R}^d$	$x + f(\text{Norm}(x))$
Hyper-connections	Eucl.	$\text{GL}(n)$	$\mathbb{R}^{n \times d}$	Learnable mixing
mHC (Sinkhorn)	Entropic	Birkhoff $\mathcal{B}_n$	$\mathbb{R}^{n \times d}$	Sinkhorn-projected mixing
JPmHC (Cayley)	Eucl. (Stiefel)	$O(n)$	$\mathbb{R}^{n \times d}$	Cayley-projected mixing
Res. Matrix Transf.	Eucl.	Learned	$\mathbb{R}^{p \times q}$	$AM + vk^\top$
Attention Residuals	Entropic (softmax)	—	$\mathbb{R}^{p \times q}$	$\sum_\ell w_\ell x^{(\ell)}$

Data axis. Optimizers as instantiations of the regularized cost.

Optimizer	Cost $\mathcal{L}$	$h$ (Norm.)	Monoid $\mathcal{M}$	Alg.
Hebbian	Alignment	Eucl.	$\{I\}$	1-step
SGD	MSE	Eucl.	Scalar	GD
SGD + wd	MSE	Eucl. ($L^2$ ball)	Scalar	GD
Adam	Any	Eucl. (diag.)	Scalar	GD+mom.
Shampoo	Any	Kronecker	—	Newton
Muon	Any	Eucl. (Stiefel)	—	GD
Lion	Any	Sign ($\ell_\infty$)	Scalar	GD+mom.
Shampoo / SOAP	Any	Kronecker	—	Newton
Kron (PSGD)	Any	Kronecker	—	Newton
Natural gradient	Any	Fisher	—	Newton
CE + nat. grad	Log-sum-exp	Fisher	—	Newton
Batch GD / ERM	Any	Eucl.	—	FTRL

The key insight from these tables: models with the same cost and algorithm but different $h$ (e.g., DeltaNet vs. Memora — both MSE/GD, but Euclidean vs. entropic $h$) produce qualitatively different retention and normalization behavior. The $h$ column reveals structure invisible in the write rule alone.

Costs and update equations

The taxonomy tables above classify models by which design choices they make. The tables below spell out the exact cost functions and update equations, making the cross-axis correspondences precise. (See the paper for the full list of 20+ models.)

Sequence axis (selected models). State $S_t \in \mathbb{R}^{d_v \times d_k}$; keys $k_t$, values $v_t$, queries $q_t$. All read-outs are $o_t = S_t q_t$ unless noted.

Model	Per-step cost $\ell_t$	State update $S_t$
Linear Attn	$-\langle v_t, S k_t \rangle$	$S_{t-1} + v_t k_t^\top$
RetNet	$-\langle v_t, S k_t \rangle$	$\gamma S_{t-1} + v_t k_t^\top$
Mamba / GLA	$-\langle v_t, S k_t \rangle$	$S_{t-1} \text{Diag}(\alpha_t) + v_t k_t^\top$
DeltaNet	$\frac{1}{2}\\|v_t - Sk_t\\|^2$	$S_{t-1}(I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top$
Gated DeltaNet	$\frac{1}{2}\\|v_t - Sk_t\\|^2$	$\alpha_t S_{t-1}(I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top$
Longhorn	$\frac{1}{2}\\|v_t - Sk_t\\|^2$	$S_{t-1}(I - \frac{\beta_t k_t k_t^\top}{1 + \beta_t \\|k_t\\|^2}) + \frac{\beta_t v_t k_t^\top}{1 + \beta_t \\|k_t\\|^2}$
Memora	$\frac{1}{2}\\|v_t - Sk_t\\|^2$	$\text{softmax}(\log S_{t-1} - \eta(S_{t-1}k_t - v_t) k_t^\top)$
Softmax Attn	$-\log\sum_j e^{k_j^\top q}$	All $(k_j, v_j)$ stored
Titans	$\frac{1}{2}\\|v_t - Mk_t\\|^2$	$M_t = \alpha M_{t-1} - \eta \mathcal{S}_t$; $\mathcal{S}_t = \beta \mathcal{S}_{t-1} + (1-\beta) \nabla_M \ell_t$

Data axis (optimizers). Weights $W_n$; input $x_n$, target $y_n$; gradient $g_n = \nabla_W \ell_n$.

Optimizer	Per-step cost $\ell_n$	Update $W_n$	Seq. analogue
Hebbian	$-\langle y_n, Wx_n \rangle$	$W_{n-1} + y_n x_n^\top$	Linear attention
SGD	$\frac{1}{2}\\|y_n - Wx_n\\|^2$	$W_{n-1} + \eta(y_n - W_{n-1}x_n) x_n^\top$	DeltaNet
SGD + wd	$\frac{1}{2}\\|y - Wx\\|^2 + \frac{\lambda}{2}\\|W\\|_F^2$	$(1 - \lambda\eta) W_{n-1} - \eta \nabla\ell_n$	Gated DeltaNet
Adam	any $\ell_n$	$m_n = \beta_1 m_{n-1} + (1-\beta_1) g_n$; $W_n = W_{n-1} - \eta \hat{m}_n / \sqrt{\hat{v}_n}$	Titans
Muon	any $\ell_n$	$m_n = \beta m_{n-1} + (1-\beta) g_n$; $W_n = W_{n-1} - \eta\, \text{polar}(m_n)$	—
Lion	any $\ell_n$	$c_n = \beta_1 m_{n-1} + (1-\beta_1) g_n$; $W_n = (1-\lambda\eta)W_{n-1} - \eta\, \text{sign}(c_n)$; $m_n = \beta_2 m_{n-1} + (1-\beta_2) g_n$	—
Natural grad.	any $\ell_n$	$W_n = W_{n-1} - \eta F_n^{-1} g_n$	Longhorn
CE + softmax	$-\log\frac{e^{w_{y_n}^\top x_n}}{\sum_c e^{w_c^\top x_n}}$	$W_n = W_{n-1} - \eta (p_n - e_{y_n}) x_n^\top$	Softmax attn

Reading across the last column makes the correspondences concrete: SGD is DeltaNet on the data axis; Adam is Titans with momentum; natural gradient is the Newton step (Longhorn). The cost form determines the update shape, and the same cost produces the same algorithmic structure on both axes.

The depth axis

The depth axis has a fundamentally different character: it is an inner loop composed with the outer axes, and it has no independent cost function.

At inference, each token triggers a complete forward pass through all layers. The residual stream $x^{(\ell)}$ is ephemeral — it exists only during this traversal. The persistent state lives on the sequence axis (KV cache). At training, the persistent state lives on the data axis (weights). The depth axis is the shared computational substrate through which both outer axes operate.

Recent work has enriched the depth axis with richer state-update structure. Hyper-connections [9] generalize residual connections with learnable mixing matrices — a full matrix-valued transition operator $A_n \in \mathbb{R}^{n \times n}$ on the depth axis, acting via matrix product on $n$ parallel streams (not element-wise as on the sequence axis).

The manifold constraint on this transition operator fits directly into the Bregman framework. Bistochastic (Sinkhorn) mixing projects $A_n$ onto the Birkhoff polytope via alternating Bregman projections with entropic $h$ — the same $h$ that gives softmax normalization on the sequence axis. Orthogonal (Cayley) mixing [29] projects onto $O(n)$ via Euclidean $h$ with a Stiefel manifold constraint — the same $(h, \mathcal{M})$ pair that gives Muon's polar decomposition on the data axis.

Why orthogonal beats bistochastic on the depth axis. A doubly stochastic matrix has Perron eigenvalue 1, but generically $|\lambda_i| < 1$ for $i \geq 2$. Deep composition drives $|\lambda_i|^L \to 0$: the transition operator forgets early layers. On the sequence axis, such contraction is desirable (recency bias). On the depth axis, it is harmful — every layer should contribute to the output. Orthogonal matrices have $|\lambda| = 1$ for all eigenvalues, preserving information perfectly. Moreover, $O(n)$ is a group (invertible), while the Birkhoff polytope is only a monoid. The design principle: axes that need forgetting use monoids; axes that need preservation use groups.

Residual matrix transformers [7] replace the residual stream with a matrix-valued state $M^{(\ell)} \in \mathbb{R}^{p \times q}$ updated via rank-one writes — making the depth axis structurally identical to the sequence axis. Attention Residuals [20] go further, replacing fixed residual accumulation with softmax attention over the depth axis, achieving ${\sim}1.25\times$ compute-equivalent improvement.

Mode	Inner axis	Outer axis	Persistent state
Inference	Depth ($\ell$)	Sequence ($t$)	KV cache $S_t^{(\ell)}$
Training	Depth ($\ell$)	Data ($n$)	Weights $W^{(\ell)}$

The data axis

The data axis is where the framework originates most naturally: optimization is the state update, and the training loss is the cost. What makes the data axis different is multiple epochs, the possibility of convergence, and the typically large scale of $N$.

Every gradient is an outer product

The correspondence between optimizers and sequence models is not merely an analogy — it is a structural identity. For any layer in a neural network, the gradient is an outer product:

$$\nabla_{W^{(\ell)}} \mathcal{L} = \underbrace{\delta^{(\ell)}}_{\text{value (backprop error)}} \underbrace{(z^{(\ell-1)})^\top}_{\text{key (input activation)}}$$

where $\delta^{(\ell)}$ is the error backpropagated to layer $\ell$ and $z^{(\ell-1)}$ is the input activation. Every weight matrix is therefore literally an associative memory, updated by outer products of (backpropagated errors, input activations) — exactly the $v_t k_t^\top$ write of the sequence axis. After $N$ gradient steps, $W_N = \sum_n \eta_n \delta_n (z_n)^\top$ is a sum of outer products. Feature learning means the keys evolve to improve storage and retrieval.

What optimizers do to outer products. Each optimizer manipulates these outer products differently. Hebbian learning sums them raw (= linear attention). SGD corrects errors before writing (= DeltaNet). Weight decay adds scalar forgetting (= Gated DeltaNet). Momentum maintains a running average (= Titans). Adam adds element-wise normalization (= diagonal bilateral whitening). Muon applies polar decomposition (= Stiefel projection). Lion takes the sign of momentum (= $\ell_\infty$ steepest descent). A mini-batch of $B$ examples produces a rank-$B$ sum of outer products, exactly like a chunk of $C$ tokens in chunked linear attention.

Learning rate schedules = positional encoding

SGD with weight decay has a forget gate $(1 - \lambda\eta_n)I$. Unrolling from initialization, the influence of gradient $g_j$ on the current weights decays as $\rho(j,N) = \prod_{s=j+1}^{N}(1-\lambda\eta_s)$. In the log domain, the data-positional distance is $d(j,N) \approx \lambda \int_j^N \eta(s)\,ds$.

The learning rate schedule $\eta(s)$ determines the metric geometry of training time:

LR schedule	Data-axis geometry	Sequence-axis analogue
Constant $\eta_0$	Uniform exponential decay	ALiBi / RetNet
Per-param weight decay	Diagonal adaptive decay	FoX / Mamba2
Cosine annealing	Linear + sinusoidal	Sinusoidal PE
Curriculum learning	Input-dependent gating	Mamba / FoX

With $\lambda = 0.1$ and $\eta_0 = 10^{-3}$, gradients have a half-life of about 7,000 steps. Adam's bias correction $\hat{m}_n = m_n / (1 - \beta_1^n)$ is a schedule-free warmup built into the optimizer.

Critical batch size = memory capacity

On the data axis, the gradient SNR for a mini-batch of size $B$ is:

$$\mathrm{SNR}_{\text{grad}}(B) = \frac{B\,\|G\|^2}{\mathrm{tr}(\Sigma)}$$

The critical batch size [8] — where SNR = 1 — is the retrieval capacity of the data-axis memory:

$$B_{\text{crit}} = \frac{\mathrm{tr}(\Sigma)}{\|G\|^2}$$

The optimal batch size $B^* \sim B_{\text{crit}}$ corresponds to the optimal chunk size $C^* \sim d$ in chunked attention: both are where switching from parallel to recurrent mode becomes worthwhile. Natural gradient reduces the effective $B_{\text{crit}}$ — the data-axis analogue of softmax sharpening retrieval.

Gradient transmission bottleneck. Beyond gradient estimation noise, gradient transmission through low-rank layers is itself a capacity limit. The LM head $W \in \mathbb{R}^{V \times D}$ with $D \ll V$ compresses the $V$-dimensional logit gradient to rank at most $D$ during backpropagation. Godey and Artzi [G&A 2026] show that 95–99% of the gradient norm is destroyed by this projection, yielding update directions fundamentally misaligned with the optimum. This is a hard bottleneck analogous to $N_{\max} \sim d_k$ on the sequence axis: just as a rank-$d_k$ memory matrix cannot faithfully store more than $d_k$ patterns, the rank-$D$ LM head cannot transmit more than $D$ directions of supervision signal per step. The ratio $D/V$ plays the role of $d_k / N$.

Cross-entropy as data-axis softmax attention

Cross-entropy training is the data-axis analogue of softmax attention. Both arise from the log-sum-exp cost form, and both achieve exponential capacity. Cross-entropy's Fisher information concentrates on hard examples (where the model is uncertain), ignoring easy ones. This "attention to difficulty" is the data-axis analogue of softmax's exponential pattern separation.

This may explain why cross-entropy dominates MSE for classification (it is the "softmax attention" of loss functions), and why Adam shows its biggest improvements with cross-entropy losses.

Adam as multi-state recurrence

Adam maintains two auxiliary states: momentum $m_n = \beta_1 m_{n-1} + (1-\beta_1)g_n$ and second moment $v_n = \beta_2 v_{n-1} + (1-\beta_2)g_n^2$. Each is a linear recurrence with its own forget gate. The effective update $m_n / \sqrt{v_n}$ is a nonlinear combination of two linear states, analogous to how attention combines KV states nonlinearly via softmax.

Adam as diagonal Shampoo. Since each gradient is an outer product $g_n = -e_n x_n^\top$, Adam's second moment accumulates element-wise squares: $(v_n)_{ij} \approx \mathbb{E}[(e_n)_i^2] \cdot \mathbb{E}[(x_n)_j^2]$. Dividing by $\sqrt{v_n}$ normalizes both the value (error) and key (input) dimensions, but only per component. Compare with Shampoo, which uses full correlation matrices $L^{-1/2} g\, R^{-1/2}$. Adam is the diagonal approximation to Shampoo — and on the sequence axis, diagonal gating (GLA, Mamba) is the diagonal approximation to full-matrix transitions (DeltaNet). This parallel extends to a normalization hierarchy:

Normalization	Seq. axis	Data axis	Structure
None	Linear attention	Hebbian	Raw outer product
Diagonal	GLA / Mamba	Adam ($1/\sqrt{v}$)	Per-component
Rank-1 (Householder)	DeltaNet	SGD on MSE	Per-sample key
Full key-side	RLS-Net (predicted)	Shampoo ($R^{-1/2}$)	Key whitening
Bilateral	—	Shampoo ($L, R$)	Both-side whitening
Sign ($\ell_\infty$)	—	Lion (sign)	Entry-wise uniform step
Stiefel proj.	—	Muon (polar)	Isometry constraint

The missing entries in the sequence-axis column are predictions: "Shampoo-Net" (bilateral whitening of the state), "Muon-Net" (polar decomposition of the state), and "Adam-Net" (Titans + element-wise normalization by key/value statistics).

Cross-axis transfer

Which aspects matter where

Although the tuple $(\mathcal{L}, h, \mathcal{M}, \mathcal{H})$ applies on every axis, the different framework dimensions are not equally important on all three:

Aspect	Sequence	Depth	Data
PE / monoid	Central (RoPE, ALiBi)	Unexplored	Not applicable (i.i.d.)
Forgetting	Important (recency)	Harmful (want isometry)	Important (weight decay, LR)
Normalization	Relevant (softmax)	Crucial (RMSNorm every layer)	Relevant (Adam, Muon)
Cost $\mathcal{L}$	Implicit (optim. view)	None (inner loop)	Explicit (training loss)
Multi-head	Well studied (MHA, GQA)	Emerging ($n$ streams)	MoE, sharding
Capacity / SNR	Well studied	Less clear	Critical batch size

Positional encoding is central on the sequence axis, where token order is fundamental. On the data axis, training examples are typically i.i.d., so ordering is a free choice rather than a constraint — curriculum learning being the exception. On the depth axis, positional encoding corresponds to inter-layer mixing geometry (hyper-connections), but this has barely been explored as a design dimension.

Forgetting shows the starkest asymmetry: desirable on the sequence axis (recency bias, finite capacity) and the data axis (plasticity via weight decay), but harmful on the depth axis (every layer should contribute). This is why the depth axis naturally favors group-valued transitions (orthogonal, isometric) while the outer axes use contractive monoids.

Normalization is ubiquitous but most critical on the depth axis, where it appears after every layer. On the sequence axis it appears as softmax in attention; on the data axis as gradient preconditioning. All three are Bregman projections under the appropriate $h$, but their frequency differs: once per layer vs. once per token vs. once per training step.

Step homogeneity reveals another fundamental asymmetry. On the sequence axis, the same model weights $W$ are applied at every step — this is what makes the computation a recurrence — and positional encoding breaks the resulting translation symmetry. On the data axis, the model weights $W_n$ are the state being updated; the shared element is the update rule (optimizer, loss function), and the LR schedule breaks that symmetry. On the depth axis, each layer has distinct parameters $W^{(\ell)}$: neither weights nor rule are shared, making depth a composition of distinct transformations rather than a recurrence. Weight-sharing architectures (Universal Transformers, ALBERT, Neural ODEs) make the depth axis structurally parallel to the others, turning composition into recurrence and making depth PE meaningful.

Unexplored combinations

The relevance map highlights that many natural combinations remain unexplored:

The rich positional encoding literature (RoPE, ALiBi, FoX) should directly inspire learning rate schedule design.
An "RLS-Net" that maintains the inverse key-correlation matrix as additional recurrent state could achieve optimal single-pass learning.
"Muon-Net": apply polar decomposition to the state matrix $S_t$ after each update, projecting onto the Stiefel manifold — transferring Muon's Stiefel projection to the sequence axis.
"Shampoo-Net": decorrelate both key and value spaces using accumulated correlations $L_t = \sum \beta^{T-t} v_t v_t^\top$ and $R_t = \sum \beta^{T-t} k_t k_t^\top$ — bilateral whitening, generalizing DeltaNet's key-only Householder.
"Adam-Net": combine Titans-style momentum with per-component normalization by accumulated key/value statistics — the sequence-axis transfer of Adam's adaptive learning rate.
Finding optimizers with rotational or reflective decay structure would complete the analogy with RoPE and PaTH on the sequence axis.
Entropic retention on the data axis (KL-regularized training with softmax-normalized weights) is an unexplored combination.

Where the analogy breaks

Exact vs. structural analogy. The SSD duality on the sequence axis is an exact algebraic identity. The depth-axis duality (Zhang [10]) is also exact. The data-axis duality (online SGD vs. batch gradient) is a structural analogy — the "batch attention" over the entire dataset is typically infeasible.

Single-pass vs. iterated. On the sequence axis, the model makes a single pass. On the data axis, multiple epochs allow convergence to solutions that a single pass can only approximate.

Depth has no independent cost. The depth axis has no per-step cost. The forward pass is inference, not optimization. The cost that shapes it is the global training loss, propagated backward through backprop.

Forgetting is asymmetric across axes. On the sequence and data axes, controlled forgetting is desirable (recency bias, plasticity). On the depth axis, forgetting is harmful — every layer should contribute. This is why bistochastic transitions ($|\lambda_i| < 1$) cause "spectral stalling" on the depth axis, while orthogonal transitions ($|\lambda| = 1$) preserve dynamical isometry. The framework's tuple applies on all axes, but the optimal choice of monoid differs: contractive monoids for the outer axes, isometric groups for the inner axis.

Weight sharing is asymmetric. On the sequence and data axes, each step applies the same shared rule — model weights (sequence) or update procedure (data) — to a varying input. This step homogeneity is what makes the computation a recurrence and PE / LR schedules meaningful as symmetry breakers. The depth axis breaks this pattern: each layer has distinct parameters $W^{(\ell)}$, making the forward pass a composition of $L$ different functions. Because $L$ is finite and fixed, distinct per-layer weights are feasible — unlike the sequence axis (unbounded $T$) or data axis (unbounded $N$) where sharing is forced. This explains why depth PE is only natural in weight-sharing architectures where the layer index is not already encoded in the parameters.

MLP memories go beyond matrix state. Titans [12] and TTT use multi-layer MLPs as the memory, updated by backpropagation on each token. The connection is direct: a matrix memory $S$ is a single-layer linear network $f_S(q) = Sq$. An MLP memory $f_\theta(q)$ generalizes this to a deeper, nonlinear function class.

The write operation is identical in both cases: gradient descent on the per-step cost, $\theta_t = \theta_{t-1} - \eta_t \nabla_\theta \ell_t$. What changes is the gradient structure. For a linear memory, $\nabla_S \ell_t = -(v_t - Sk_t)\,k_t^\top$ is a rank-1 outer product — the clean algebraic structure that gives monoids and parallel scans. For an MLP, the gradient involves the chain rule through activations, breaking this rank-1 structure.

What’s preserved: the regularized cost framework and the $(\ell, h, \mathcal{M}_{\text{norm}})$ triple apply unchanged — they are orthogonal to the function class. Forgetting acts on $\theta$ via weight decay: scalar retention, the same structure that Frank-Wolfe gives. What breaks: the semiseparable structure (nonlinear transition), the SNR capacity bound ($O(p/d_v)$ for $p$ parameters instead of $O(d_k)$), and the linear read operation. The MLP case illustrates a general principle: the optimization-based write framework transfers to any parameterized function class; the algebraic structure is specific to the linear case.

Nonlinearity complicates transfer. Hu et al. [22] show that SSD fails for softmax attention due to rank explosion. M²RNN [19] and COFFEE [25] introduce nonlinearities that break semiseparable structure but improve expressiveness.

Open questions

The rotation monoid on the data axis. RoPE uses rotational transitions. What optimizer has a rotational forget gate?
Quantifying exponential capacity. Can we verify the predicted exponential gap in $B_{\text{crit}}$ for cross-entropy vs. MSE?
Depth-axis positional encoding. What geometry do hyper-connections induce across layers?
The normalization–Bregman conjecture. Is every normalization in deep learning a Bregman projection $\nabla h^*$ whose generating function $h$ also determines retention?
An RLS-Net. Can the pseudoinverse be achieved efficiently in a single-pass recurrent architecture?

References

Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention Record.
Yang, S., Wang, B., Zhang, Y., Shen, Y., and Kim, Y. (2024). Parallelizing linear transformers with the delta rule over sequence length. NeurIPS.
Zhong, S., Xu, M., Ao, T., and Shi, G. (2025). Understanding transformer from the perspective of associative memory. arXiv:2505.19488.
Dao, T. and Gu, A. (2024). Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. ICML.
Afzal, A. (2026). On the legacy of linear transformers in positional embeddings. arshiaafzal.github.io/blog/2026/pe/
Ramsauer, H. et al. (2021). Hopfield networks is all you need. ICLR.
Mak, B. and Flanigan, J. (2025). Residual matrix transformers. ICML. arXiv:2506.22696.
McCandlish, S., Kaplan, J., Amodei, D., and OpenAI Dota Team (2018). An empirical model of large-batch training. arXiv:1812.06162.
Zhu, D. et al. (2024). Hyper-connections. arXiv:2409.19606.
Zhang, Y. (2026). Residual stream duality in modern transformer architectures. arXiv:2603.16039.
Storkey, A. J. (1997). Increasing the capacity of a Hopfield network without sacrificing functionality. ICANN.
Behrouz, A., Razaviyayn, M., Zhong, P., and Mirrokni, V. (2025). It's all connected: A journey through test-time memorization, attention, state space models, and recurrent neural networks. arXiv:2504.13173.
Wang, K. A., Shi, J., and Fox, E. B. (2025). Test-time regression: A unifying framework. arXiv:2501.12352.
von Oswald, J. et al. (2023). Transformers learn in-context by gradient descent. arXiv:2212.07677.
Yang, S., Kautz, J., and Hatamizadeh, A. (2025). Gated Delta Networks. ICLR. arXiv:2412.06464.
Lin, Z., Nikishin, E., He, X. O., and Courville, A. (2025). FoX: Forgetting Transformer. ICLR. arXiv:2503.02130.
Yang, S. et al. (2025). PaTH: Positional encodings via accumulated Householder transformations. NeurIPS. arXiv:2505.16381.
Lahoti, A. et al. (2026). Mamba-3. ICLR. arXiv:2603.15569.
Mishra, M., Tan, S., Stoica, I., Gonzalez, J., and Dao, T. (2026). M²RNN: Matrix-valued recurrent neural networks for scalable language modeling. arXiv:2603.14360.
Chen, G. et al. (2026). Attention Residuals. arXiv:2603.15031.
Guo, H. et al. (2025). Log-linear attention. ICLR 2026. arXiv:2506.04761.
Hu, J. Y.-C. et al. (2025). On structured state-space duality. arXiv:2510.04944.
Bernstein, J. and Newhouse, L. (2024). Old optimizer, new norm: An anthology. arXiv:2409.20325.
Duvvuri, S. S. et al. (2026). LUCID: Attention with preconditioned representations. arXiv:2602.10410.
Zattra, R. et al. (2025). Context-selective state space models: Feedback is all you need. arXiv:2510.14027.
Gu, A. (2025). On the tradeoffs of SSMs and Transformers. goombalab.github.io.
Hwang, S., Lahoti, A., Dao, T., and Gu, A. (2024). Hydra: Bidirectional state space models through generalized matrix mixers. arXiv:2407.09941.
Kaczmarz, S. (1937). Angenäherte Auflösung von Systemen linearer Gleichungen. Bull. Int. Acad. Pol. Sci. Lett.
Sengupta, B., Wang, J., and Brunswic, L. (2025). JPmHC: Dynamical isometry via orthogonal hyper-connections. arXiv:2602.18308.
Godey, N. and Artzi, Y. (2026). Lost in Backpropagation: The LM Head is a Gradient Bottleneck. arXiv:2603.10145.
Chen, X. et al. (2024). Symbolic discovery of optimization algorithms. NeurIPS.
Liang, K. et al. (2024). Cautious optimizers. arXiv:2411.16085.
Dragutinović, S. and Ranganath, R. (2025). To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters. arXiv:2603.00742.
Anonymous (2026). Lions and Muons: Optimization via Stochastic Frank-Wolfe. Under review at ICLR 2026.

Associative Memory, States, Attention, Optimization:A Unifying Perspective

The same algorithm, three axes

The sequence axis as associative memory

Write, read, and the key-value abstraction

What to write: the cost

How to write: the optimization algorithm

Retention and forgetting

Capacity and the SNR

Duality: recurrent and attention views

Multi-head: block-diagonal memory

The general framework

The regularized cost

The cost hierarchy

The write-strategy hierarchy

Retention, normalization, and the generating function $h$

Key geometry and the chain from cost to capacity

Taxonomy: models as tuples

Costs and update equations

The depth axis

The data axis

Every gradient is an outer product

Learning rate schedules = positional encoding

Critical batch size = memory capacity

Cross-entropy as data-axis softmax attention

Adam as multi-state recurrence

Cross-axis transfer

Which aspects matter where

Unexplored combinations

Where the analogy breaks

Open questions

References

Associative Memory, States, Attention, Optimization:
A Unifying Perspective