In this post, we review the Newton-like method for learning rate selection. This provides a learning rate selection process that works as a wrapper for any optimizer (such as SGD, Adam, AdamW, and so on). The method is perfectly general, with no constraints imposed on the optimizer.
The PyTorch implementation is available here.
$$ \newcommand{\bN}{\mathbb{N}} \newcommand{\bR}{\mathbb{R}} \newcommand{\bZ}{\mathbb{Z}} \newcommand{\rmvec}{\mathrm{vec}} \newcommand{\sL}{\mathscr{L}} \newcommand{\tr}{\mathrm{trace}} \definecolor{lesserbox}{rgb}{0.85, 0.95, 1.0} \definecolor{magicmint}{rgb}{0.67, 0.94, 0.82} $$In general
Given a smooth function \(f : \bR^n \to \bR\), the gradient descent iterations are
$$ \theta_{t+1} \leftarrow \theta_t - \alpha_t \omega(\theta_t, \nabla f(\theta_t)). $$Here \(\alpha_t \in \bR_{\geq 0}\) is the learning rate at step \(t\), and \(\omega : \bR^n \times \bR^n \to \bR^n\) is the optimizer.
Define the “value per learning rate” functions \(g_t : \bR_{\geq 0} \to \bR\) by
$$ g_t(\alpha) = f(\theta_t - \alpha_t \omega(\theta_t, \nabla f(\theta_t))). $$Following [1], [2], we propose to minimize \(f\) by the Newton-like iterations
$$ \left\{ \begin{align*} \theta_{t+1} &\leftarrow \theta_t - \alpha_t \omega(\theta_t, \nabla f(\theta_t)) \\ \alpha_{t+1} &\leftarrow \alpha_t - \frac{g'_t(\alpha_t)}{g''_t(\alpha_t)}. \end{align*} \right. $$In the rest of this post, we will simply write
$$ \omega_t = \omega(\theta_t, \nabla f(\theta_t)). $$To write the Newton-like iterations more explicitly, observe that
$$ \begin{align*} g'_t(\alpha_t) &= - df(\theta_t - \alpha_t \omega_t) \cdot \omega_t \\ &= - \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle. \\ \end{align*} $$Similarly, we have
$$ \begin{align*} g''_t(\alpha_t) &= d^2 f(\theta_t - \alpha_t \omega_t) \cdot (\omega_t, \omega_t) \\ &= \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t, \end{align*} $$where \(H(f)(\theta)\) is the Hessian matrix of \(f\) evaluated at \(\theta\). The Newton-like iterations are
$$ \colorbox{lesserbox} { $ \left\{ \begin{align*} \theta_{t+1} &\leftarrow \theta_t - \alpha_t \omega_t \\ \alpha_{t+1} &\leftarrow \alpha_t + \frac{ \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t }. \end{align*} \right. $ } $$We would like to avoid the impracticality of computing the Hessian-vector products. To do so, we will use an approximation.
Retsinas et al. approximation
The authors of [1] derive their approximation of the Hessian-vector products starting with a first-order Taylor series approximation of the partial derivatives of \(f\):
$$ \begin{align*} d_i f (\theta_t) &= d_i f (\theta_t - \alpha_t \omega_t + \alpha_t \omega_t) \\ &\approx d_i f (\theta_t - \alpha_t \omega_t) + \alpha_t d(d_i f)(\theta_t - \alpha_t \omega_t) \cdot \omega_t. \end{align*} $$We can write these approximations more explicitly as
$$ \begin{align*} d_i f (\theta_t) &\approx d_i f (\theta_t - \alpha_t \omega_t) + \alpha_t \begin{bmatrix} d^2_{i,1} f(\theta_t - \alpha_t \omega_t) & \cdots & d^2_{i,n} f(\theta_t - \alpha_t \omega_t) \end{bmatrix} \omega_t. \end{align*} $$Stacking these quantities, we have
$$ \begin{align*} \nabla f (\theta_t) \approx \nabla f (\theta_t - \alpha_t \omega_t) + \alpha_t H(f)(\theta_t - \alpha_t \omega_t) \omega_t, \end{align*} $$where \(H(f)(\theta)\) is the Hessian matrix of \(f\) evaluated at \(\theta\).
Rearranging and multiplying on the left by \(\omega_t^T\), we obtain
$$ \begin{align*} \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t \approx \frac{1}{\alpha_t} \langle \nabla f (\theta_t) - \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle, \quad \alpha_t > 0. \end{align*} $$This reproduces Equation (7) in [1].
Using this approximation, the \(\alpha\) iteration becomes
$$ \colorbox{lesserbox} { $ \alpha_{t+1} \leftarrow \alpha_t + \frac{ \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle } { \langle \nabla f (\theta_t) - \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle }. $ } $$Adding a slow-adaptation parameter \(\gamma \in (0, 1]\) gives Equation (8) in [1]:
$$ \alpha_{t+1} \leftarrow \alpha_t \left( 1 + \gamma \frac{ \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle } { \langle \nabla f (\theta_t) - \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle } \right). $$We are interested in how this compares to \(\alpha\) iterations which are derived using different approximations of the Hessian-vector product.
Alternative approximation 1
Our starting point is a totally standard second-order Taylor series approximation of \(f\):
$$ \begin{align*} f(\theta_t) &= f(\theta_t - \alpha_t \omega_t + \alpha_t \omega_t) \\ &\approx f(\theta_t - \alpha_t \omega_t) + \alpha_t df(\theta_t - \alpha_t \omega_t) \cdot \omega_t + \frac{\alpha_t^2}{2} d^2 f(\theta_t - \alpha_t \omega_t) \cdot (\omega_t, \omega_t) \\ &= f(\theta_t - \alpha_t \omega_t) + \alpha_t \langle \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle + \frac{\alpha_t^2}{2} \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t. \end{align*} $$Rearranging, we obtain
$$ \begin{align*} &\omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t \\ &\qquad \approx \frac{2}{\alpha_t^2} (f(\theta_t) - f(\theta_t - \alpha_t \omega_t)) - \frac{2}{\alpha_t} \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle, \quad \alpha_t > 0. \end{align*} $$Using this approximation, the \(\alpha\) iteration becomes
$$ \colorbox{lesserbox} { $ \begin{align*} \alpha_{t+1} &\leftarrow \alpha_t + \frac{ \alpha_t^2 \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ 2 (f(\theta_t) - f(\theta_t - \alpha_t \omega_t) - \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle) }. \end{align*} $ } $$How do the two approximations compare? Setting them equal, we see that
$$ \begin{align*} &\frac{1}{\alpha_t} \langle \nabla f(\theta_t), \omega_t \rangle - \frac{1}{\alpha_t} \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle \\ &\qquad = \frac{2}{\alpha_t^2} (f(\theta_t) - f(\theta_t - \alpha_t \omega_t)) - \frac{2}{\alpha_t} \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle \end{align*} $$if and only if
$$ \begin{align*} \frac{\langle \nabla f(\theta_t), \omega_t \rangle + \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle}{2} = \frac{f(\theta_t) - f(\theta_t - \alpha_t \omega_t)}{\alpha_t}. \end{align*} $$From this it seems the developments in [1] involve an implicit first-order Taylor series approximation to the lookahead value difference \(f(\theta_t) - f(\theta_t - \alpha_t \omega_t)\). To understand the size of the error term incurred by using this approximation, observe that
$$ \frac{f(\theta_t) - f(\theta_t - \alpha_t \omega_t)}{\alpha_t} = \langle \nabla f(\theta_t), \omega_t \rangle - \frac{\alpha_t}{2} \omega_t^T H(f)(\theta_t) \omega_t + O(\alpha_t^3). $$On the other hand,
$$ \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle = \langle \nabla f(\theta_t), \omega_t \rangle - \alpha_t \omega_t^T H(f)(\theta_t) \omega_t + O(\alpha_t^2) $$which implies that
$$ \begin{align*} \frac{\langle \nabla f(\theta_t), \omega_t \rangle + \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle}{2} = \langle \nabla f(\theta_t), \omega_t \rangle - \frac{\alpha_t}{2} \omega_t^T H(f)(\theta_t) \omega_t + O(\alpha_t^2). \end{align*} $$We conclude that
$$ \begin{align*} \frac{\langle \nabla f(\theta_t), \omega_t \rangle + \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle}{2} = \frac{f(\theta_t) - f(\theta_t - \alpha_t \omega_t)}{\alpha_t} + O(\alpha_t^2). \end{align*} $$This means that the developments in [1] incur an extra \(O(\alpha_t^2)\) error term by approximating the lookahead value difference.
There seems to be no advantage to this, since in any case we need to compute \(f(\theta_t)\) and \(f(\theta_t - \alpha_t \omega_t)\), the latter as the “forward” part of the \(\nabla f(\theta - \alpha_t \omega_t)\) computation.
Adding a slow-adaptation parameter \(\gamma \in (0, 1]\) gives the final \(\alpha\) iteration:
$$ \begin{align*} \alpha_{t+1} &\leftarrow \alpha_t \left( 1 + \gamma \frac{ \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ 2 (f(\theta_t) - f(\theta_t - \alpha_t \omega_t) - \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle) } \right). \end{align*} $$Alternative approximation 2
As in the previous section, we start with a second-order Taylor series approximation of \(f\):
$$ \begin{align*} f(\theta_t - \alpha_t \omega_t) &\approx f(\theta_t) + \alpha_t \langle \nabla f (\theta_t - \alpha_t \omega_t), \omega_t \rangle + \frac{\alpha_t^2}{2} \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t. \end{align*} $$From this it follows that
$$ \frac{f(\theta_t - \alpha_t \omega_t) + f(\theta_t + \alpha_t \omega_t) - 2 f(\theta_t)}{\alpha_t^2} \approx \omega_t^T H(f)(\theta_t - \alpha_t \omega_t) \omega_t. $$Using this approximation, the \(\alpha\) iteration becomes
$$ \colorbox{lesserbox} { $ \begin{align*} \alpha_{t+1} &\leftarrow \alpha_t + \frac{ \alpha_t^2 \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ f(\theta_t - \alpha_t \omega_t) + f(\theta_t + \alpha_t \omega_t) - 2f(\theta_t) }. \end{align*} $ } $$Adding a slow-adaptation parameter \(\gamma \in (0, 1]\) gives the final \(\alpha\) iteration:
$$ \begin{align*} \alpha_{t+1} &\leftarrow \alpha_t \left( 1 + \gamma \frac{ \alpha_t \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ f(\theta_t - \alpha_t \omega_t) + f(\theta_t + \alpha_t \omega_t) - 2f(\theta_t) } \right). \end{align*} $$The numerator can also be estimated without gradient computations, since
$$ \langle \nabla f(\theta_t - \alpha_t \omega_t), \omega_t \rangle \approx \frac{f(\theta_t + \alpha_t \omega_t) - f(\theta_t - \alpha_t \omega_t)}{2 \alpha_t}. $$ML context
In the context of machine learning, there is (at least in principle) a single loss function to be minimized. Typically, a distinct proxy loss function is used in each gradient descent iteration, since the training batch data changes with each iteration. To account for this, we can rewrite the Newton-like iterations as
$$ \colorbox{lesserbox} { $ \left\{ \begin{align*} \theta_{t+1} &\leftarrow \theta_t - \alpha_t \omega_t \\ \alpha_{t+1} &\leftarrow \alpha_t + \frac{ \langle \nabla L_t(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ \omega_t^T H(L_t)(\theta_t - \alpha_t \omega_t) \omega_t }. \end{align*} \right. $ } $$where \(L_t\) is the proxy loss function at iteration \(t\) and \(\omega_t = \omega(\theta_t, \nabla L_t(\theta_t))\).
For example, the \(\alpha\) iteration using the Retsinas et al. approximation is
$$ \left\{ \begin{align*} \alpha_{t+1} \leftarrow \alpha_t + \frac{ \alpha_t \langle \nabla L_t(\theta_t - \alpha_t \omega_t), \omega_t \rangle }{ \langle \nabla L_t(\theta_t) - \nabla L_t(\theta_t - \alpha_t \omega_t), \omega_t \rangle) }. \end{align*} \right. $$Implementation
To prevent \(\alpha\) from vanishing or growing without bound, we can clamp it in addition to making it adapt slowly.
References
- G. Retsinas, G. Sfikas, P. Filntisis and P. Maragos, “Newton-Based Trainable Learning Rate,” ICASSP 2023
- G. Retsinas, G. Sfikas, P. Filntisis and P. Maragos, “Trainable Learning Rate”, 2022, retracted.