F
Feed Atlas
OPML directory + server-side RSS reader

gilesthomas.com

SiteRSSBlogs
Back

Latest posts

  • Writing an LLM from scratch, part 32b -- Interventions: gradient clipping
    Feb 05, 2026

    I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the mi

  • Writing an LLM from scratch, part 32c -- Interventions: removing dropout
    Feb 05, 2026

    This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.67

  • Writing an LLM from scratch, part 32d -- Interventions: adding attention bias
    Feb 06, 2026

    I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This is the third intervention I'm trying: adding bias to the attention weight matrices. In the code from the book, we have this: class MultiHeadAttention(nn.Module): def __init__( self,

  • Writing an LLM from scratch, part 32e -- Interventions: the learning rate
    Mar 10, 2026

    I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In my training code, I have this code to create the optimiser: optimizer = torch.optim.AdamW( model.parameters(), lr=0.0004, weight_decay=0.1 ) The values in there -- 0.0004 for the learning

  • Writing an LLM from scratch, part 32f -- Interventions: weight decay
    Mar 23, 2026

    I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In my training code, I have this code to create the optimiser: optimizer = torch.optim.AdamW( model.parameters(), lr=0.0004, weight_decay=0.1 ) In my last post I looked into the learning rat

  • Writing an LLM from scratch, part 32g -- Interventions: weight tying
    Mar 24, 2026

    In Sebastian Raschka's book "Build a Large Language Model (from Scratch)", he writes that weight tying, while it reduces the parameter count of a model, in his experience makes it worse. As such, apparently people don't use it in modern LLMs. Intuitively, that makes sense -- I'll explain why in this post. But as I'm trying various interventions to see if I can get my model -- based on Raschka's

  • Automating starting Lambda Labs instances
    Apr 02, 2026

    I've been trying to get an 8x A100 instance on Lambda Labs to do a training run for my LLM from scratch series, but they're really busy at the moment, and it's rare to see anything. Thanks to the wonders of agentic coding, I spent an hour today getting something up and running to help, which I've called lambda-manager. It has three commands: list-instance-types, which prints which kinds of instanc

  • Writing an LLM from scratch, part 32h -- Interventions: full fat float32
    Apr 03, 2026

    This is the last of the interventions I'm trying out to see if I can improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Back when I did my first training run for a base model, on my local RTX 3090, I used two optimisations: Setting the 32-bit floating point matrix multiplication precision

  • Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?
    Apr 07, 2026

    Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090, using code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular t

  • Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud
    Apr 09, 2026

    Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I trained from scratch on my local RTX 3090, using code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". My original model got a loss of 3.944 on my test set, while the original GPT-2 weights got 3.500 on the same dataset. I wanted to see if I could close that ga