Paper Summary #9  Sophia: A Scalable Stochastic Secondorder Optimizer for Language Model Pretraining
Understanding Sophia  A new fast, scalable secondorder optimizer which beats Adam on LLM pretraining
Paper: Sophia: A Scalable Stochastic Secondorder Optimizer for Language Model Pretraining
Link: https://arxiv.org/abs/2305.14342
Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma
Code: https://github.com/Liuhong99/Sophia
I have also released an annotated version of the paper. If you are interested, you can find it here.
Sophia is probably one of the most interesting papers I have read recently and I really liked how well it was written. This post is basically the notes that I had made while reading the paper, which is why it is not exactly a blog post and most of it is verbatim copied from the paper. But since there are a lot of optimizationtheory related concepts which have been mentioned in the paper, I have tried to add my own set of references which I have read in the past that helped me understand the paper better. Hopefully it helps someone!
Overview

The goal of this work is to propose a new optimizer for pretraining LLMs that can improve pretraining efficiency with a faster optimizer, which either reduces the time and cost to achieve the same pretraining loss, or alternatively achieves better pretraining loss with the same budget.

Adam and its variants have become the somewhat default optimizers which are used in LLM pretraining.

Designing fast optimizers for LLMs is challenging because 
 The benefit of the firstorder preconditioner (the $1/\sqrt{v_t}$ factor) in Adam is still not well understood.
 The choice of preconditioners is constrained because we can only afford lightweight options whose overhead can be offset by the speedup in the number of iterations.
 Among the recent works on lightweight gradientbased preconditioners, Lion stood out as it is substantially faster than Adam on vision Transformers and diffusion models but only achieves limited speedup on LLMs.

This paper introduces Sophia, Secondorder Clipped Stochastic Optimization, a lightweight secondorder optimizer that uses an inexpensive stochastic estimate of the diagonal of the Hessian as a preconditioner and a clipping mechanism to control the worstcase update size.

Key Results 
 Sophia achieves the same validation pretraining loss with 50% fewer number of steps than Adam.
 Sophia maintains almost the memory and average time per step and therefore the speedup also translates to 50% less total compute and 50% less wallclock time.
 The scaling law based on model size from 125M to 770M is in favor of Sophia over Adam  the gap between Sophia and Adam with 100K steps increases as the model size increases. Sophia on a 540Mparameter model with 100K steps gives the same validation loss as Adam on a 770Mparameter model with 100K steps.

Checking the performance and scalability of this optimizer for pretraining much larger model sizes would (although expensive) be interesting to see.

Sophia estimates the diagonal entries of the Hessian of the loss using a minibatch of examples every $k$ steps ($k=10$ in the paper).

The paper considers two options for diagonal Hessian estimators 
 Hutchinson’s unbiased estimator  an unbiased estimator that uses a Hessianvector product with the same runtime as a minibatch gradient up to a constant factor
 GaussNewtonBartlett (GNB) estimator  a biased estimator that uses one minibatch gradient calculated with resampled labels.

Both the estimators introduce only a 5% overhead per step (on average).

Sophia updates the parameter with an exponential moving average (EMA) of the gradient divided by the EMA of the diagonal Hessian estimate, subsequently clipped by a scalar.

Due to the Hessianbased preconditioner, Sophia adapts more efficiently, than Adam does, to the heterogeneous curvatures in different parameter dimensions, which can often occur in the landscape of LLMs losses and cause instability or slowdown.

In terms of the loss landscape, Sophia has a more aggressive preconditioner than Adam  Sophia applies a stronger penalization to updates in sharp dimensions (where the Hessian is large) than the flat dimensions (where the Hessian is small), ensuring a uniform loss decrease across all parameter dimensions. In contrast, Adam’s updates are mostly uniform across all parameter dimensions, leading to a slower loss decrease in flat dimensions.

Sophia’s clipping mechanism controls the worstcase size of the updates in all directions, safeguarding against the negative impact of inaccurate Hessian estimates, rapid Hessian changes over time, and nonconvex landscape.
Motivations
Heterogeneous Curvatures
 Loss functions in modern deep learning problems often have different curvatures across different parameter dimensions.
 The paper demonstrates the limitations of Adam and Gradient Descent by considering a two dimensional loss function  $L(\theta_{[1]},\theta_{[2]}) = L_1(\theta_{[1]}) + L_2(\theta_{[2]})$ where $$L_{1}(\theta_{[1]}) = 8(\theta_{[1]}  1)^2(1.3\theta_{[1]}^2 + 2\theta_{[1]} + 1)$$ $$L_2(\theta_{[2]}) = 1/2 (\theta_{[2]}  4)^2$$
 Here $L_1$ is much sharper than $L_2$.
 Another optimizer, SignGD is also compared which is quite old but can be understood as a simplified version of Adam, which does not involve taking the EMA for gradients and second moments of the gradients. The update then simplifies to  $\eta \cdot \nabla L(\theta)/\nabla L(\theta) = \eta\cdot \textup{sign}(\nabla L(\theta))$.
Limitations of GD and SignGD (Adam)
 The optimal learning rate of Gradient Descent should be proportional to the inverse of the curvature, i.e., the Hessian/second derivative at the local minimum. A good set of resources to understand this in detail are some notes from University of Toronto Section 2.2 and Section 4.1.
 So, if the curvatures of $L_1$ and $L_2$ at the local minima are $h_1$ and $h_2$ respectively (and thus $h_1 > h_2$ ). So the largest shared learning rate can only be $1/h_1$. Hence, the convergence in $\theta_{[2]}$ dimension is slow as also shown in the figure above.
 The update size of SignGD is the learning rate $\eta$ in all dimensions. Hence, intuitively, the same update size translates to less progress in decreasing the loss in the flat direction than in the sharp direction. In the yellow curve in the above figure, the progress of SignGD in the flat dimension $\theta_{[2]}$ is slow and along $\theta_{[1]}$, theiterate quickly travels to the valley in the first three steps and then starts to bounce. To fully converge in the sharp dimension, the learning rate $\eta$ needs to decay to $0$, which will exacerbate the slow convergence in the flat dimension $θ_{[2]}$. The trajectory of Adam is similar to SignGD and shown by the red curve in the figure.
 The behavior of SignGD and Adam above indicates that a more aggressive preconditioning is needed  sharp dimensions should have relatively smaller updates than flat dimensions so that the decrease of loss is equalized in all dimensions.
 Prior work on second order optimization suggest that the optimal preconditioner is the Hessian which captures the curvature on each dimension.
 The Newton’s method, does something similar  the update is the gradient divided by the Hessian in each dimension.$$\theta_{[1]} \leftarrow \theta_{[1]}  \eta \cdot L’_1(\theta_{[1]}) / h_1\ \ \textup{and}\ \ \theta_{[2]} \leftarrow \theta_{[2]}  \eta \cdot L’_2(\theta_{[2]})/ h_2$$
Limitations of Newton’s method
 Vanilla Newton’s method could converge to a global maximum when the local curvature is negative.
 As shown in the blue curve in the figure, Newton’s method quickly converges to a saddle point instead of a local minimum.
 Since the curvature may change rapidly along a trajectory, the second order information can often become unreliable.
 Sophia addresses this by considering only preconditioners that capture positive curvature, and introduce a percoordinate clipping mechanism to mitigate the rapid change of Hessian. Applying those changes results in the following update  $$\theta_{[1]} \leftarrow \theta_{[1]}  \eta \cdot \textup{clip}(\frac{ L’_1(\theta_{[1]})}{\max{h_1,\epsilon}} ,\rho)\ \textup{and}\ \theta_{[2]} \leftarrow \theta_{[2]}  \eta \cdot \textup{clip}(\frac{ L’_2(\theta_{[2]})}{\max{h_2,\epsilon}},\rho)$$
 Here, $\rho$ is a constant to control the worstcase update size, $\epsilon$ is a very small constant (e.g., 1e12) to avoid dividing by 0.
 The beauty here is that when the curvature of some dimension is rapidly changing or negative and thus the secondorder information is misleading and possibly leads to a huge update before clipping, the clipping mechanism kicks in and the optimizer defaults to SignGD (even though this is suboptimal for benign situations).
 The black curve in the figure starts off similarly to SignGD due to the clipping mechanism in the nonconvex region, making descent opposed to converging to a local maximum. In the convex valley, it converges to the global minimum with a few steps.
 Compared with SignGD and Adam, it makes much faster progress in the flat dimension $\theta_{[2]}$ (because the update is bigger in dimension $\theta_{[2]}$), while avoiding bouncing in the sharp dimension $\theta_{[1]}$ (because the update is significantly shrunk in the sharp dimension $\theta_{[1]}$).
Sophia: Secondorder Clipped Stochastic Optimization
EMA of diagonal Hessian estimates
 The diagonal Hessian estimates definitely have overheads, so it is computed at every $k$ steps.
 At every time step $t$, where $t \ \textup{mod} \ k = 1$, the estimator returns an estimate $h_t$ of the diagonal of the Hessian of the minibatch loss.
 Similar to the gradient of the minibatch loss function, the estimated diagonal Hessian can also have large noise.
 The EMA performed at every $k$ steps helps denoise the estimates. $$h_t = \beta_2 h_{tk} + (1  \beta_2) \hat{h}_{t} \ \textup{ if } t \ \mathrm{mod} \ k = 1 ; \textup{ else } h_t = h_{t1}$$
Percoordinate clipping
 I personally found this very innovative.
 As mentioned in the previous section, the inaccuracy of Hessian estimates and the change of Hessian along the trajectory can make the secondorder information unreliable.
 $$\theta_{t+1} \leftarrow \theta_{t}  \eta_t \cdot \textup{clip}(m_t / \max{h_t,\epsilon}, \rho)$$
 When any entry of $h_t$ is negative, e.g., $h_t[i] < 0$, the corresponding entry in the preconditioned gradient $m_t[i]/\max{h_t[i],\epsilon} = m_t[i]/\epsilon$ is extremely large and has the same sign as $m_t[i]$, and thus $\eta\cdot \textup{clip}(m_t[i] / \max{h_t[i],\epsilon}, \rho) = \eta\rho\cdot \textup{sign}(m_t[i])$, which is the same as stochastic momentum SignSGD.
 Sophia uses stochastic momentum SignSGD as a backup when the Hessian is negative (or mistakenly estimated to be negative or very small.) We also note that t
 The clipping mechanism controls the worstcase size of the updates in all parameter dimensions to be at most $\rho$, which also improves the stability (which could be a severe issue for secondorder methods).
Diagonal Hessian Estimators
 I’ll keep this section short and brief, for the simple reason that the math is extremely interesting but also a bit difficult to understand. So, I’ll add references to the best of my knowledge and I encourage the reader to go through them thoroughly.
Hutchinson’s unbiased estimator
 For any loss function $\ell(\theta)$ on parameters $\theta\in \mathbb{R}^d$, the Hutchinson’s estimator can be used to obtain an unbiased estimator for the diagonal of the Hessian.
 First, draw $u\in \mathbb{R}^d$ from the spherical Gaussian distribution $\mathcal{N}(0,\mathrm{I}_d)$, and then output $\hat{h} = u \odot (\nabla^2 \ell(\theta) u)$, where $\odot$ denotes the elementwise product, and $\nabla^2 \ell(\theta) u$ is the HVP of the Hessian with $u$.
 Using HVP (from JAX or Pytorch allows us to efficiently compute the product of the Hessian and a vector without the actual computation of the full Hessian matrix.
 Effectively, $\mathbb{E}[\hat h] = \mathrm{diag}(\nabla^2 \ell(\theta))$.
 How does it work? The idea for the above comes from the original paper from Hutchinson  “A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines” which was about estimating the trace of a matrix. The idea there was 
 Dimensionally, and somewhat intuitively, it kinda makes sense to change the dot product to an elementwise product to get the nonaggregated estimate of the diagonal (since trace is the sum of the diagonal elements).
 This idea was also used in the “Hessian Diagonal Approximation” section of the AdaHessian paper, and I would also encourage the reader to go through Section 2 of the paper  “An Estimator for the Diagonal of a Matrix”, which the AdaHessian paper cites as well.
GaussNewtonBartlett (GNB) estimator
 The math for this section is particularly interesting and I would redirect you to my annotated paper where I have mentioned references and also worked out some of the math.
 I’d highly recommend reading Section 4.4 of these notes and optionally Section 4 of these notes (not absolutely required, just helps to understand the footnote in the Sophia paper).
 Ultimately, the GNB estimator used by the authors is $$B \cdot \nabla_\theta \widehat L(\theta) \odot \nabla_\theta \widehat L(\theta)$$ where $\widehat L(\theta) = \frac{1}{B}\sum_{b=1}^B \ce(f(\theta, x_b), \hat{y}_b)$, here $\hat{y}_b$ are not the labels corresponding to $x_b$. They are just __sampled labels__ from the minibatch.
 The reason we can do that is in the math, which comes from the combination of
 The claim in the paper that the secondorder derivative of the loss w.r.t. the logits only depends on the logits and the true labels $y$.
 Bartlett’s first identity, which generally holds for the negative loglikelihood loss of any probabilistic model and which states  $$\forall b, \mathbb{E}_{\hat{y}_b}\nabla L\textup{ce}(f(\theta,x_b),\hat{y}_b) = 0$$
 Also, the above estimator is an unbiased estimator for the diagonal of the GaussNewton matrix, which is a biased estimator for the diagonal of the Hessian.
 Additional References  Bartlett Identities
Comparison of Hessian estimators
 The Hutchinson’s estimator does not assume any structure of the loss, but requires a Hessianvector product.
 The GNB estimator only estimates the GaussNewton term but always gives a positive semidefinite (nonnegative) diagonal Hessian estimate. The PSDness ensures that the preconditioned update is always a descent direction.
 The GaussNewton Matrix is guaranteed to be PSD if the abovementioned secondorder derivative of the loss w.r.t. the logits is PSD. For proof, refer this.
Experiments
 There are tons of details in the paper regarding the experiments. I’ll just mention the key points here, just in case I have to look up something quickly.
Experimental Setup
Baselines
 Main comparison with AdamW and Lion
 AdamW for GPT2 hyperparams  WD = 0.1, $\beta_1 = 0.9$ and $\beta_2 = 0.95$
 Lion for GPT2 hyperparams  $\beta_1 = 0.95$ and $\beta_2 = 0.98$
Implementation
 Batch Size = 480
 Cosine LR schedule with the final LR equal to 0.05 times the peak LR
 Standard gradient clipping (by norm) threshold 1.0
 Fixed 2k steps of LR warmup
 For Sophia, the authors use $\beta_1 = 0.96$, $\beta_2 = 0.99$, $\epsilon=$ 1e12 and update diagonal Hessian every 10 steps.
 SophiaH (which refers to Sophia with Hutchinson estimator) uses $\rho=0.01$, and only a subset of 32 examples from the minibatch to calculate the diagonal Hessian to further reduce overhead.
 SophiaG (which refers to Sophia with GNB estimator) uses $\rho=20$, and use a subset of 240 examples from the minibatch to calculate the diagonal GaussNewton.
 All models are trained in bfloat16.
 The 125M and 355M models are trained on A5000 GPUs, while the 770M models are trained on A100 GPUs. Total amount of compute spent on all experiments is about 6000 hours on A100s and 10000 hours on A5000s. This amounts to 4.38e21 FLOPs.
Results
 Sophia consistently achieves better validation loss than AdamW and Lion.
 As the model size grows, the gap between Sophia and baselines also becomes larger. SophiaH and SophiaG both achieve a 0.04 smaller validation loss on the 355M model, SophiaH achieves a 0.05 smaller validation loss on the 770M model with the same 100k steps.
 According to the scaling laws in this regime, an improvement in loss of 0.05 is equivalent to 2x improvement in terms of number of steps or total compute to achieve the same validation loss
 Sophia is 2x faster in terms of number of steps, total compute and wallclock time.
 The scaling law is in favor of SophiaH over AdamW
 The 540M model trained by SophiaH has smaller loss than the 770M model trained by AdamW.
 The 355M model trained by SophiaH has comparable loss as the 540M model trained by AdamW.
 Fewshot Evaluation on Downstream Tasks (SuperGLUE)
 GPT2 medium and GPT2 large pretrained with Sophia have better fewshot accuracy on most subtasks.
 Models pretrained with SophiaH have comparable fewshot accuracy as models pretrained with AdamW for 2x number of steps.
 Sensitivity to $\rho$ and $\beta_2$, and transferability of hyperparameters
 While performing grid search on hyperparams on a smaller 30M model, the authors found that all combinations have a similar performance. Moreover, this hyperparameter choice is transferable across model sizes. For all the experiments on 125M, 355M and 770M, we use the hyperparameters searched on the 30M model, which is ρ = 0.01, β2 = 0.99.
 Training Stability
 Gradient clipping (by norm) is an important technique in language model pretraining as it avoids messing up the moment of gradients with one minibatch gradient computed from rare data.
 In practice, the frequency that gradients clipping is triggered is related to the training stability  if the gradient is frequently clipped, the iterate can be in a very unstable state.
 Although all methods use the same clipping threshold 1.0, SophiaH seldomly triggers gradient clipping, while AdamW and Lion trigger gradient clipping in more than 10% of the steps
 Another common trick of pretraining deep Transformers is scaling the product of keys and values by the inverse of the layer index [Mistral]. This stabilizes training and increases the largest possible learning rate.
 Without this trick, the maximum learning rate of AdamW and Lion on GPT2 medium (355M) can only be 1.5e4, which is much smaller than 3e4 with the trick (the loss will blow up with 3e4 without the trick). Moreover, the loss decreases much slower without the trick as shown. In all the experiments, SophiaH does not require scaling the product of keys and values by the inverse of the layer index.
Limitations
 Scaling up to larger models and datasets
 The paper only experiments with GPT2 pretraining on OpenWebText for model sizes up to 770M params.
 Although it is faster and better than Adam and Lion in these set of experiments, and the scaling laws and pretraining stability are encouraging, it remains to be seen how well Sophia scales on larger models.
 Holistic downstream evaluation
 The paper has only experimented with 4 SuperGLUE tasks and although the results are encouraging, a better downstream evaluation is still important.
 To note  The limitation in downstream evaluation is also due to the limited model size, because language models at this scale do not have enough capabilities such as incontext learning, and mathematical reasoning.
 Evaluation on other domains
 This paper focuses on optimizers for LLMs, it should be evaluated in other domains like CV, RL and Multimodal tasks.
A very exciting paper. Hope people can test it out on even bigger models and across multiple domains and we may potentially have an optimizer finally dethroning Adam!