The Newton-like method for learning rate selection

In this post, we review the Newton-like method for learning rate selection. This provides a learning rate selection process that works as a wrapper for any optimizer (such as SGD, Adam, AdamW, and so on). The method is perfectly general, with no constraints imposed on the optimizer.

The PyTorch implementation is available here.

$$ \newcommand{\bN}{\mathbb{N}} \newcommand{\bR}{\mathbb{R}} \newcommand{\bZ}{\mathbb{Z}} \newcommand{\rmvec}{\mathrm{vec}} \newcommand{\sL}{\mathscr{L}} \newcommand{\tr}{\mathrm{trace}} \definecolor{lesserbox}{rgb}{0.85, 0.95, 1.0} \definecolor{magicmint}{rgb}{0.67, 0.94, 0.82} $$

In general

Given a smooth function $f : \bR^n \to \bR$, the gradient descent iterations are

$$ \theta_{t+1} \leftarrow \theta_t - \alpha_t \omega(\theta_t, \nabla f(\theta_t)). $$

Here $\alpha_t \in \bR_{\geq 0}$ is the learning rate at step $t$, and $\omega : \bR^n \times \bR^n \to \bR^n$ is the optimizer.

Define the “value per learning rate” functions $g_t : \bR_{\geq 0} \to \bR$ by

$$ g_t(\alpha) = f(\theta_t - \alpha_t \omega(\theta_t, \nabla f(\theta_t))). $$

Following [1], [2], we propose to minimize $f$ by the Newton-like iterations

$$ \left\{ \begin{align*} \theta_{t+1} &\leftarrow \theta_t - \alpha_t \omega(\theta_t, \nabla f(\theta_t)) \\ \alpha_{t+1} &\leftarrow \alpha_t - \frac{g'_t(\alpha_t)}{g''_t(\alpha_t)}. \end{align*} \right. $$

In the rest of this post, we will simply write

$$ \omega_t = \omega(\theta_t, \nabla f(\theta_t)). $$

To write the Newton-like iterations more explicitly, observe that

$$ \begin{align*} g'_t(\alpha_t) &= - df(\theta_t - \alpha_t \omega_t) \cdot \omega_t \\ &= - \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle. \\ \end{align*} $$

Similarly, we have

$$ \begin{align*} g''_t(\alpha_t) &= d^2 f(\theta_t - \alpha_t \omega_t) \cdot (\omega_t, \omega_t) \\ &= \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t, \end{align*} $$

where $H(f)(\theta)$ is the Hessian matrix of $f$ evaluated at $\theta$. The Newton-like iterations are

$$ \colorbox{lesserbox} { $ \left\{ \begin{align*} \theta_{t+1} &\leftarrow \theta_t - \alpha_t \omega_t \\ \alpha_{t+1} &\leftarrow \alpha_t + \frac{ \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t }. \end{align*} \right. $ } $$

We would like to avoid the impracticality of computing the Hessian-vector products. To do so, we will use an approximation.

Retsinas et al. approximation

The authors of [1] derive their approximation of the Hessian-vector products starting with a first-order Taylor series approximation of the partial derivatives of $f$:

$$ \begin{align*} d_i f (\theta_t) &= d_i f (\theta_t - \alpha_t \omega_t + \alpha_t \omega_t) \\ &\approx d_i f (\theta_t - \alpha_t \omega_t) + \alpha_t d(d_i f)(\theta_t - \alpha_t \omega_t) \cdot \omega_t. \end{align*} $$

We can write these approximations more explicitly as

$$ \begin{align*} d_i f (\theta_t) &\approx d_i f (\theta_t - \alpha_t \omega_t) + \alpha_t \begin{bmatrix} d^2_{i,1} f(\theta_t - \alpha_t \omega_t) & \cdots & d^2_{i,n} f(\theta_t - \alpha_t \omega_t) \end{bmatrix} \omega_t. \end{align*} $$

Stacking these quantities, we have

$$ \begin{align*} \nabla f (\theta_t) \approx \nabla f (\theta_t - \alpha_t \omega_t) + \alpha_t H(f)(\theta_t - \alpha_t \omega_t) \omega_t, \end{align*} $$

where $H(f)(\theta)$ is the Hessian matrix of $f$ evaluated at $\theta$.

Rearranging and multiplying on the left by $\omega_t^T$, we obtain

$$ \begin{align*} \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t \approx \frac{1}{\alpha_t} \langle \nabla f (\theta_t) - \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle, \quad \alpha_t > 0. \end{align*} $$

This reproduces Equation (7) in [1].

Using this approximation, the $\alpha$ iteration becomes

$$ \colorbox{lesserbox} { $ \alpha_{t+1} \leftarrow \alpha_t + \frac{ \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle } { \langle \nabla f (\theta_t) - \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle }. $ } $$

Adding a slow-adaptation parameter $\gamma \in (0, 1]$ gives Equation (8) in [1]:

$$ \alpha_{t+1} \leftarrow \alpha_t \left( 1 + \gamma \frac{ \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle } { \langle \nabla f (\theta_t) - \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle } \right). $$

We are interested in how this compares to $\alpha$ iterations which are derived using different approximations of the Hessian-vector product.

Alternative approximation 1

Our starting point is a totally standard second-order Taylor series approximation of $f$:

$$ \begin{align*} f(\theta_t) &= f(\theta_t - \alpha_t \omega_t + \alpha_t \omega_t) \\ &\approx f(\theta_t - \alpha_t \omega_t) + \alpha_t df(\theta_t - \alpha_t \omega_t) \cdot \omega_t + \frac{\alpha_t^2}{2} d^2 f(\theta_t - \alpha_t \omega_t) \cdot (\omega_t, \omega_t) \\ &= f(\theta_t - \alpha_t \omega_t) + \alpha_t \langle \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle + \frac{\alpha_t^2}{2} \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t. \end{align*} $$

Rearranging, we obtain

$$ \begin{align*} &\omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t \\ &\qquad \approx \frac{2}{\alpha_t^2} (f(\theta_t) - f(\theta_t - \alpha_t \omega_t)) - \frac{2}{\alpha_t} \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle, \quad \alpha_t > 0. \end{align*} $$

Using this approximation, the $\alpha$ iteration becomes

$$ \colorbox{lesserbox} { $ \begin{align*} \alpha_{t+1} &\leftarrow \alpha_t + \frac{ \alpha_t^2 \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ 2 (f(\theta_t) - f(\theta_t - \alpha_t \omega_t) - \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle) }. \end{align*} $ } $$

How do the two approximations compare? Setting them equal, we see that

$$ \begin{align*} &\frac{1}{\alpha_t} \langle \nabla f(\theta_t), \omega_t \rangle - \frac{1}{\alpha_t} \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle \\ &\qquad = \frac{2}{\alpha_t^2} (f(\theta_t) - f(\theta_t - \alpha_t \omega_t)) - \frac{2}{\alpha_t} \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle \end{align*} $$

if and only if

$$ \begin{align*} \frac{\langle \nabla f(\theta_t), \omega_t \rangle + \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle}{2} = \frac{f(\theta_t) - f(\theta_t - \alpha_t \omega_t)}{\alpha_t}. \end{align*} $$

From this it seems the developments in [1] involve an implicit first-order Taylor series approximation to the lookahead value difference $f(\theta_t) - f(\theta_t - \alpha_t \omega_t)$. To understand the size of the error term incurred by using this approximation, observe that

$$ \frac{f(\theta_t) - f(\theta_t - \alpha_t \omega_t)}{\alpha_t} = \langle \nabla f(\theta_t), \omega_t \rangle - \frac{\alpha_t}{2} \omega_t^T H(f)(\theta_t) \omega_t + O(\alpha_t^3). $$

On the other hand,

$$ \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle = \langle \nabla f(\theta_t), \omega_t \rangle - \alpha_t \omega_t^T H(f)(\theta_t) \omega_t + O(\alpha_t^2) $$

which implies that

$$ \begin{align*} \frac{\langle \nabla f(\theta_t), \omega_t \rangle + \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle}{2} = \langle \nabla f(\theta_t), \omega_t \rangle - \frac{\alpha_t}{2} \omega_t^T H(f)(\theta_t) \omega_t + O(\alpha_t^2). \end{align*} $$

We conclude that

This means that the developments in [1] incur an extra $O(\alpha_t^2)$ error term by approximating the lookahead value difference.

There seems to be no advantage to this, since in any case we need to compute $f(\theta_t)$ and $f(\theta_t - \alpha_t \omega_t)$, the latter as the “forward” part of the $\nabla f(\theta - \alpha_t \omega_t)$ computation.

Adding a slow-adaptation parameter $\gamma \in (0, 1]$ gives the final $\alpha$ iteration:

$$ \begin{align*} \alpha_{t+1} &\leftarrow \alpha_t \left( 1 + \gamma \frac{ \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ 2 (f(\theta_t) - f(\theta_t - \alpha_t \omega_t) - \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle) } \right). \end{align*} $$

Alternative approximation 2

As in the previous section, we start with a second-order Taylor series approximation of $f$:

$$ \begin{align*} f(\theta_t - \alpha_t \omega_t) &\approx f(\theta_t) + \alpha_t \langle \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle + \frac{\alpha_t^2}{2} \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t. \end{align*} $$

From this it follows that

$$ \frac{f(\theta_t - \alpha_t \omega_t) + f(\theta_t + \alpha_t \omega_t) - 2 f(\theta_t)}{\alpha_t^2} \approx \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t. $$

Using this approximation, the $\alpha$ iteration becomes

$$ \colorbox{lesserbox} { $ \begin{align*} \alpha_{t+1} &\leftarrow \alpha_t + \frac{ \alpha_t^2 \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ f(\theta_t - \alpha_t \omega_t) + f(\theta_t + \alpha_t \omega_t) - 2f(\theta_t) }. \end{align*} $ } $$

Adding a slow-adaptation parameter $\gamma \in (0, 1]$ gives the final $\alpha$ iteration:

$$ \begin{align*} \alpha_{t+1} &\leftarrow \alpha_t \left( 1 + \gamma \frac{ \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ f(\theta_t - \alpha_t \omega_t) + f(\theta_t + \alpha_t \omega_t) - 2f(\theta_t) } \right). \end{align*} $$

The numerator can also be estimated without gradient computations, since

$$ \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle \approx \frac{f(\theta_t + \alpha_t \omega_t) - f(\theta_t - \alpha_t \omega_t)}{2 \alpha_t}. $$

ML context

In the context of machine learning, there is (at least in principle) a single loss function to be minimized. Typically, a distinct proxy loss function is used in each gradient descent iteration, since the training batch data changes with each iteration. To account for this, we can rewrite the Newton-like iterations as

$$ \colorbox{lesserbox} { $ \left\{ \begin{align*} \theta_{t+1} &\leftarrow \theta_t - \alpha_t \omega_t \\ \alpha_{t+1} &\leftarrow \alpha_t + \frac{ \langle \nabla L_t(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ \omega_t^T H(L_t)(\theta_t - \alpha_t \omega_t) \omega_t }. \end{align*} \right. $ } $$

where $L_t$ is the proxy loss function at iteration $t$ and $\omega_t = \omega(\theta_t, \nabla L_t(\theta_t))$.

For example, the $\alpha$ iteration using the Retsinas et al. approximation is

$$ \left\{ \begin{align*} \alpha_{t+1} \leftarrow \alpha_t + \frac{ \alpha_t \langle \nabla L_t(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ \langle \nabla L_t(\theta_t) - \nabla L_t(\theta_t - \alpha_t \omega_t), \omega_t \rangle) }. \end{align*} \right. $$

Implementation

To prevent $\alpha$ from vanishing or growing without bound, we can clamp it in addition to making it adapt slowly.

References

G. Retsinas, G. Sfikas, P. Filntisis and P. Maragos, “Newton-Based Trainable Learning Rate,” ICASSP 2023
G. Retsinas, G. Sfikas, P. Filntisis and P. Maragos, “Trainable Learning Rate”, 2022, retracted.

In general#

Retsinas et al. approximation#

Alternative approximation 1#

Alternative approximation 2#

ML context#

Implementation#

References#

In general

Retsinas et al. approximation

Alternative approximation 1

Alternative approximation 2

ML context

Implementation

References