Hi there!

This is Scott’s blog, mostly about the mathematics of machine learning…

The generalized Newton's method for learning rate selection

In this post, we review the generalized Newton’s method (GeN) proposed in [1]. Then, we explicitly compute the learning rates prescribed by the exact version of GeN, for a simple problem instance. Then, we give a high-level overview of a PyTorch implementation which runs the exact version of GeN for stochastic gradient descent. ...

December 25, 2024

Input-output sensitivity in LSTM networks

Introduction The literature is full of claims that the LSTM architecture is well-adapted to learning input-output dependence over long time lags, and there is a large amount of empirical evidence supporting this claim. Nevertheless, I couldn’t find a proof of this claim, at least not in the form of a direct analysis of input-output sensitivity. In this post, we get the ball rolling on a direct analysis. First, we derive a recurrence relation which relates the input-output sensitivities over arbitrarily long time lags. Then, we use the recurrence relation to show that a particular arrangement of hidden states preserves input-output sensitivity. ...

October 6, 2024

The gradient of cosine similarity

Introduction In this post, we compute the gradient of the cosine similarity function. Then, we numerically verify the result’s correctness using PyTorch’s gradcheck() function. ...

October 4, 2024

An explicit formula for the gradient of scaled dot-product attention

Introduction In this post, we derive an explicit formula for the gradient of the scaled dot-product attention map. Then, we numerically verify the formula’s correctness using PyTorch’s gradcheck() function. ...

September 19, 2024

Modified attention maps with simpler total derivatives

In this post, we show that a modified attention map, which has softmax replaced by a different normalizing map, has a simpler total derivative. The goal is to improve the computational efficiency of backpropagation through attention-type maps. ...

August 9, 2024

Higher-order total derivatives of softmax

In this post, we will compute the second-order and third-order total derivatives of the softmax map. At some point, I thought that the higher-order derivatives could be used to cheaply compute approximations of the softmax map, but that’s a story for another post. ...

August 8, 2024