Hi there!

This is Scott’s blog, mostly about the mathematics of machine learning…

The Newton-like method for learning rate selection

In this post, we review the Newton-like method for learning rate selection. This provides a learning rate selection process that works as a wrapper for any optimizer (such as SGD, Adam, AdamW, and so on). The method is perfectly general, with no constraints imposed on the optimizer. The PyTorch implementation is available here. ...

The generalized Newton's method for learning rate selection

In this post, we review the generalized Newton’s method (GeN) proposed in [1]. Then, we explicitly compute the learning rates prescribed by the exact version of GeN, for a simple problem instance. Then, we give a high-level overview of a PyTorch implementation which runs the exact version of GeN for stochastic gradient descent. ...

Input-output sensitivity in LSTM networks

Introduction The literature is full of claims that the LSTM architecture is well-adapted to learning input-output dependence over long time lags, and there is a large amount of empirical evidence supporting this claim. Nevertheless, I couldn’t find a proof of this claim, at least not in the form of a direct analysis of input-output sensitivity. In this post, we get the ball rolling on a direct analysis. First, we derive a recurrence relation which relates the input-output sensitivities over arbitrarily long time lags. Then, we use the recurrence relation to show that a particular arrangement of hidden states preserves input-output sensitivity. ...

The gradient of cosine similarity

Introduction In this post, we compute the gradient of the cosine similarity function. Then, we numerically verify the result’s correctness using PyTorch’s gradcheck() function. ...

An explicit formula for the gradient of scaled dot-product attention

Introduction In this post, we derive an explicit formula for the gradient of the scaled dot-product attention map. Then, we numerically verify the formula’s correctness using PyTorch’s gradcheck() function. ...

Modified attention maps with simpler total derivatives

In this post, we show that a modified attention map, which has softmax replaced by a different normalizing map, has a simpler total derivative. The goal is to improve the computational efficiency of backpropagation through attention-type maps. ...

Higher-order total derivatives of softmax

In this post, we will compute the second-order and third-order total derivatives of the softmax map. At some point, I thought that the higher-order derivatives could be used to cheaply compute approximations of the softmax map, but that’s a story for another post. ...