Modified attention maps with simpler total derivatives

In this post, we show that a modified attention map, which has softmax replaced by a different normalizing map, has a simpler total derivative. The goal is to improve the computational efficiency of backpropagation through attention-type maps.

Standard attention

The standard attention map is defined by

\begin{aligned} Att : R^{n \times d} \times R^{n \times d} \times R^{n \times d} & \to R^{n \times d} \\ (Q, K, V) & \mapsto Att (Q, K, V) = σ (Q K^{t}) V, \end{aligned}

where $σ : R^{n} \to R^{n}$ is the softmax map applied row-wise. This means that

\begin{aligned} Att (Q, K, V) & = \sum_{i = 1}^{n} e_{i} σ ((e_{i}^{t} Q K^{t})^{t})^{t} V \\ = \sum_{i = 1}^{n} e_{i} σ (K Q^{t} e_{i})^{t} V, \end{aligned}

where $e_{i}$ is the $i$ th Euclidean basis vector in $R^{n}$ . This expression could be written in matrix form, of course, but it seems cleaner to put everything on a single line.

Note: Our definition of the attention map does not scale the entries of $Q K^{t}$ by $1 / \sqrt{d}$ . This is just for convenience, to avoid writing the square root everywhere.

The total derivative of $Att$ is

\begin{array}{r} d Att (Θ) \cdot \tilde{Θ} = d_{Q} Att (Θ) \cdot \tilde{Q} + d_{K} Att (Θ) \cdot \tilde{K} + d_{V} Att (Θ) \cdot \tilde{V}, \end{array}

where $Θ = (Q, K, V)$ and $\tilde{Θ} = (\tilde{Q}, \tilde{K}, \tilde{V})$ .

By the Leibniz rule, the partial derivatives of $Att$ are

\begin{aligned} d_{Q} Att (Θ) \cdot \tilde{Q} & = \sum_{i = 1}^{n} e_{i} [d σ (p_{i}) \cdot {\tilde{q}}_{i}]^{t} V \\ d_{K} Att (Θ) \cdot \tilde{K} & = \sum_{i = 1}^{n} e_{i} [d σ (p_{i}) \cdot {\tilde{k}}_{i}]^{t} V \\ d_{V} Att (Θ) \cdot \tilde{V} & = \sum_{i = 1}^{n} e_{i} σ (p_{i})^{t} \tilde{V} = Att (Q, K, \tilde{V}), \end{aligned}

where $p_{i} = K Q^{t} e_{i}$ , ${\tilde{q}}_{i} = K {\tilde{Q}}^{t} e_{i}$ , and ${\tilde{k}}_{i} = \tilde{K} Q^{t} e_{i}$ . Combining terms, we have

\begin{aligned} d Att (Θ) \cdot \tilde{Θ} & = \sum_{i = 1}^{n} e_{i} [d σ (p_{i}) \cdot {\tilde{z}}_{i}]^{t} V + Att (Q, K, \tilde{V}), \end{aligned}

where ${\tilde{z}}_{i} = {\tilde{q}}_{i} + {\tilde{k}}_{i}$ . Using the well-known formula for $d σ$ , we have

\begin{aligned} \sum_{i = 1}^{n} e_{i} [d σ (p_{i}) \cdot {\tilde{z}}_{i}]^{t} V \\ = \sum_{i = 1}^{n} e_{i} (σ (p_{i}) ⊙ {\tilde{z}}_{i})^{t} V - \sum_{i = 1}^{n} ⟨ σ (p_{i}), {\tilde{z}}_{i} ⟩ e_{i} σ (p_{i})^{t} V \\ = \sum_{i = 1}^{n} e_{i} (σ (p_{i}) ⊙ {\tilde{z}}_{i})^{t} V - (\tilde{ι} (Q, K) \otimes 1_{n}^{t}) ⊙ Att (Q, K, V), \end{aligned}

where $⊙$ is the element-wise product and $\tilde{ι} (Q, K) \otimes 1_{n}^{t}$ is the Kronecker product of

\tilde{ι} (Q, K) = [\begin{matrix} ⟨ σ (p_{1}), {\tilde{z}}_{1} ⟩ \\ ⋮ \\ ⟨ σ (p_{n}), {\tilde{z}}_{n} ⟩ \end{matrix}] and 1_{n}^{t} = (1, \dots, 1) .

In total, we have

\begin{aligned} d Att (Θ) \cdot \tilde{Θ} & = \sum_{i = 1}^{n} e_{i} (σ (p_{i}) ⊙ {\tilde{z}}_{i})^{t} V - (\underset{σ (p_{i})}{\underset{⏟}{\tilde{ι} (Q, K)}} \otimes 1_{n}^{t}) ⊙ Att (Θ) + \underset{\sum_{i = 1}^{n} e_{i} σ (p_{i})^{t}}{\underset{⏟}{Att (Q, K, \tilde{V})}} . \end{aligned}

The quantities in blue can be re-used from the forward pass.

In the next section, we consider how the above result changes if we replace $σ$ .

Other normalizing maps

In this section, $β : R^{n} \to R^{n}$ is a smooth map with the following homogeneity property: There exists a smooth function $f : R^{n} \times R^{n} \to R$ such that

\begin{array}{r} d β (x) \cdot h = f (x, h) β (x), \end{array}

Replacing $σ$ with $β$ , the $β$ -attention map is defined by

\begin{aligned} {Att}_{β} : R^{n \times d} \times R^{n \times d} \times R^{n \times d} & \to R^{n \times d} \\ (Q, K, V) & \mapsto {Att}_{β} (Q, K, V) = β (Q K^{t}) V, \end{aligned}

where $β$ is applied row-wise.

Repeating the analysis of the previous section, we have

\begin{aligned} {Att}_{β} (Θ) & = \sum_{i = 1}^{n} e_{i} β (p_{i})^{t} V \end{aligned}

and the total derivative of ${Att}_{β}$ at $Θ$ is

\begin{aligned} d {Att}_{β} (Θ) \cdot \tilde{Θ} & = \sum_{i = 1}^{n} e_{i} [d β (p_{i}) \cdot {\tilde{z}}_{i}]^{t} V + {Att}_{β} (Q, K, \tilde{V}) \\ = \sum_{i = 1}^{n} f (p_{i}, {\tilde{z}}_{i}) e_{i} β (p_{i})^{t} V + {Att}_{β} (Q, K, \tilde{V}) \\ = (\tilde{f} (Q, K) \otimes 1_{n}^{t}) ⊙ {Att}_{β} (Θ) + {Att}_{β} (Q, K, \tilde{V}), \end{aligned}

where $p_{i}$ , ${\tilde{z}}_{i}$ are defined as in the previous section and

\tilde{f} (Q, K) = [\begin{matrix} f (p_{1}, {\tilde{z}}_{1}) \\ ⋮ \\ f (p_{n}, {\tilde{z}}_{n}) \end{matrix}] .

In total, we have

\begin{aligned} d {Att}_{β} (Θ) \cdot \tilde{Θ} & = (\underset{p_{i}}{\underset{⏟}{\tilde{f} (Q, K)}} \otimes 1_{n}^{t}) ⊙ {Att}_{β} (Θ) + \underset{\sum_{i = 1}^{n} e_{i} β (p_{i})^{t}}{\underset{⏟}{{Att}_{β} (Q, K, \tilde{V})}} . \end{aligned}

The quantities in blue can be re-used from the forward pass.

The punchline is that the homogeneity property could make the computation of $d {Att}_{β} (Θ) \cdot \tilde{Θ}$ more efficient by “removing” the first term of $d Att (Θ) \cdot \tilde{Θ}$ .

For a particular example, consider the simple normalizing map

\begin{array}{r} β (x) = \frac{x}{1 + ‖ x ‖} . \end{array}

The total derivatives of $β$ are

\begin{aligned} d β (x) \cdot h & = {\begin{cases} \frac{h}{1 + ‖ x ‖} - \frac{⟨ x, h ⟩ x}{‖ x ‖ (1 + ‖ x ‖)^{2}}, & x \neq 0_{n} \\ h, & x = 0_{n} . \end{cases} \end{aligned}

By definition, $f$ must satisfy

\begin{aligned} f (x, h) \frac{x}{1 + ‖ x ‖} & = {\begin{cases} \frac{h}{1 + ‖ x ‖} - \frac{⟨ x, h ⟩ x}{‖ x ‖ (1 + ‖ x ‖)^{2}}, & x \neq 0_{n} \\ h, & x = 0_{n} . \end{cases} \end{aligned}

This is clearly not possible for $x = 0_{n}$ , so from this point forward let’s assume that we are working away from $x = 0_{n}$ (this can be made rigorous, but we’ll skip that).

Rearranging terms, we see that

\begin{aligned} f (x, h) & = \frac{⟨ x, h ⟩}{‖ x ‖^{2}} - \frac{⟨ x, h ⟩}{‖ x ‖ (1 + ‖ x ‖)} \\ = \frac{⟨ x, h ⟩}{‖ x ‖^{2} (1 + ‖ x ‖)} . \end{aligned}

Plausibility of replacing $σ$

Putting aside potential efficiency gains, can we learn effectively with ${Att}_{β}$ ?

To quickly test this for

\begin{array}{r} β (x) = \frac{x}{1 + ‖ x ‖}, \end{array}

we can build on the nanoGPT project. In a nutshell, we need to implement $β$ , disable flash attention, and adjust the causal masking to accommodate $β$ .

The nn.Module that implements $β$ is very straightforward:

"""Implementation of beta map."""

import torch
import torch.nn as nn
from torch import Tensor

class Beta(nn.Module):

    def __init__(self) -> None:
        super().__init__()

    def forward(self, x: Tensor) -> Tensor:
        """Compute beta() along last dimension of x."""
        return x / (1.0 + torch.norm(x, dim=-1, keepdim=True))

Training character-level Tiny Shakespeare for 10000 iterations with nanoGPT defaults:

step 10000: train loss 1.0040, val loss 1.5550

The final validation loss is comparable to the results obtained using standard attention:

step 10000: train loss 0.7140, val loss 1.6204

Standard attention#

Other normalizing maps#

Plausibility of replacing σ#

Standard attention

Other normalizing maps

Plausibility of replacing $σ$