Latest posts
- Writing an LLM from scratch, part 32b -- Interventions: gradient clippingFeb 05, 2026
I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the mi
- Writing an LLM from scratch, part 32c -- Interventions: removing dropoutFeb 05, 2026
This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.67
- Writing an LLM from scratch, part 32d -- Interventions: adding attention biasFeb 06, 2026
I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This is the third intervention I'm trying: adding bias to the attention weight matrices. In the code from the book, we have this: class MultiHeadAttention(nn.Module): def __init__( self,
- Writing an LLM from scratch, part 32e -- Interventions: the learning rateMar 10, 2026
I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In my training code, I have this code to create the optimiser: optimizer = torch.optim.AdamW( model.parameters(), lr=0.0004, weight_decay=0.1 ) The values in there -- 0.0004 for the learning
- Writing an LLM from scratch, part 32f -- Interventions: weight decayMar 23, 2026
I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In my training code, I have this code to create the optimiser: optimizer = torch.optim.AdamW( model.parameters(), lr=0.0004, weight_decay=0.1 ) In my last post I looked into the learning rat
- Writing an LLM from scratch, part 32g -- Interventions: weight tyingMar 24, 2026
In Sebastian Raschka's book "Build a Large Language Model (from Scratch)", he writes that weight tying, while it reduces the parameter count of a model, in his experience makes it worse. As such, apparently people don't use it in modern LLMs. Intuitively, that makes sense -- I'll explain why in this post. But as I'm trying various interventions to see if I can get my model -- based on Raschka's
- Automating starting Lambda Labs instancesApr 02, 2026
I've been trying to get an 8x A100 instance on Lambda Labs to do a training run for my LLM from scratch series, but they're really busy at the moment, and it's rare to see anything. Thanks to the wonders of agentic coding, I spent an hour today getting something up and running to help, which I've called lambda-manager. It has three commands: list-instance-types, which prints which kinds of instanc
- Writing an LLM from scratch, part 32h -- Interventions: full fat float32Apr 03, 2026
This is the last of the interventions I'm trying out to see if I can improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Back when I did my first training run for a base model, on my local RTX 3090, I used two optimisations: Setting the 32-bit floating point matrix multiplication precision
- Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?Apr 07, 2026
Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090, using code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular t
- Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloudApr 09, 2026
Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I trained from scratch on my local RTX 3090, using code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". My original model got a loss of 3.944 on my test set, while the original GPT-2 weights got 3.500 on the same dataset. I wanted to see if I could close that ga