Latest posts
- How an LLM becomes more coherent as we train itApr 17, 2026
I remember finding it interesting when, back in 2015, Andrej Karpathy posted about RNNs and gave an example of how their output improves over the course of a training run. What might that look like for a (relatively) modern transformers-based LLM? I recently trained a GPT-2-small-style LLM, with 163 million parameters, on about 3.2 billion tokens (that's about 12.8 GiB of text) from the Hugging Fa
- Writing an LLM from scratch, part 32l -- Interventions: updated instruction fine-tuning resultsApr 20, 2026
I've been working on a GPT-2-small-style LLM based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and have tried a bunch of different things to see if I could get it to approach the quality of the original OpenAI GPT-2-small, measured in terms of loss on a held-back test dataset. After working through them, in my last post, I managed to train one that was almost (if no
- Writing an LLM from scratch, part 32m -- Interventions: conclusionApr 21, 2026
Last November, when I finished the main body of "Build a Large Language Model (from Scratch)", I set myself a number of follow-on goals. One was "training the full GPT-2 base model myself". I've reached the end of that journey, with a model that is almost -- if not quite -- as good as GPT-2 small, trained in 44 hours on my own machine, so I thought it would be worth summarising how it went. In Dec
- Writing an LLM from scratch, part 33 -- what I learned from finally getting round to the appendicesApr 22, 2026
After finishing the main body of "Build a Large Language Model (from Scratch)", I set myself three follow-on goals. The first was training a full GPT-2-small-style base model myself. That was reasonably easy to do but unlocked a bunch of irresistible side quests; having finally got to the end of those, it's time to move on to the others: reading through the book's appendices, and building my own
- 10Gb/s Ethernet: what I had to (re)learnApr 28, 2026
My ISP recently started offering a 10Gb option, and my "shiny new thing!" Pavlovian response immediately kicked in. So of course, I had to upgrade the wired networking in my home -- which meant I had to learn a few things to get it all working, and relearn a bunch of stuff I'd forgotten over the years. Wired networking for home and small offices hasn't really moved forward that much in the last 2
- 10Gb/s Ethernet: what I actually did to get it working in my homeApr 29, 2026
Having learned enough about 10Gb/s Ethernet to be comfortable about setting it up in my house, it was time to bite the bullet: order it from the ISP, buy some kit, and get started. I already had 2.5Gb/s working. The apartment has structured cabling -- each room has one or more RJ45 sockets in the wall, and there's a patch panel downstairs by our front door that has a matching patch socket for eac
- 10Gb/s Ethernet: using mini-heatsinks with a 10GBASE-T SFP+ moduleMay 18, 2026
In my last post I showed the somewhat-scary temperatures I was getting on the MikroTik 10GBASE-T SFP+ module I have plugged into nigel, the 10Gb/s switch I have in my study. As I mentioned then, the plan was to try using some of the mini-heatsinks that people use on Raspberry Pis, to see if that would help. Here's how it went. I bought a 40-piece set of heatsinks made by the improbably-named VooGe
- On first looking into JAXMay 30, 2026
Much have I travell'd in the realms of gold, On First Looking into Chapman's Homer I've been working with PyTorch quite a lot for the last couple of years, and feel like I've come to a reasonably solid understanding of how it all fits together. Working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)", training my own LLMs locally and in the cloud, rebuilding Andrej Ka
- Using Safetensors with FlaxJun 04, 2026
I'm porting my PyTorch LLM code to JAX, using Flax as the neural network layer. For various reasons I wanted to use Safetensors to store checkpoints of the model. It took a little while to get it working; here's the trick I learned. If you look at the Safetensors docs, you'll see that it doesn't mention a JAX implementation -- indeed, searching for "safetensors jax" at the time I'm writing this g
- JAX backends and devicesJun 05, 2026
There's nothing like writing your own code with a framework to clarify how things fit together! Continuing with my port of my PyTorch LLM code to JAX, I wanted to load up a large dataset: the 10,248,871,837 16-bit unsigned integers in the train split of gpjt/fineweb-gpt2-tokens. That's just over 19GiB of data. from safetensors.flax import load_file ... full_dataset = load_file(dataset_dir / f"tra