gilesthomas.com

Site RSSBlogs

Back

Latest posts

Using Safetensors with Flax
Jun 04, 2026
I'm porting my PyTorch LLM code to JAX, using Flax as the neural network layer. For various reasons I wanted to use Safetensors to store checkpoints of the model. It took a little while to get it working; here's the trick I learned. If you look at the Safetensors docs, you'll see that it doesn't mention a JAX implementation -- indeed, searching for "safetensors jax" at the time I'm writing this g
JAX backends and devices
Jun 05, 2026
There's nothing like writing your own code with a framework to clarify how things fit together! Continuing with my port of my PyTorch LLM code to JAX, I wanted to load up a large dataset: the 10,248,871,837 16-bit unsigned integers in the train split of gpjt/fineweb-gpt2-tokens. That's just over 19GiB of data. from safetensors.flax import load_file ... full_dataset = load_file(dataset_dir / f"tra
JAX: commitment issues
Jun 15, 2026
Imagine you have JAX code like this, and run it on a machine with CUDA set up: key = jax.random.key(42) cpu0 = jax.devices("cpu")[0] with jax.default_device(cpu0): array = jax.random.randint( key, (530640, 6, 1024), 0, 50_000, dtype=jax.numpy.uint16 ) array.block_until_ready() item = array[0] item.block_u
10Gb/s Ethernet: switching to a Broadcom SFP+ module
Jun 16, 2026
Back in April, I upgraded my home LAN to 10Gb/s. The in-wall cabling is CAT-6 or similar, so I had to use 10GBASE-T. Now, the router I'm using, and the switch in my study, provide 10Gb/s through SFP+ cages; that meant that they needed 10GBASE-T SFP+ modules in order to connect. That kind of module is known to run hot -- sometimes too hot to actually work. The modules in reggie, the router, appea
Flax debugging: making a hash of things
Jun 17, 2026
I was debugging an issue with a JAX/Flax NNX training loop the other day, and found a neat little trick to help debug it. Specifically, I wanted to see if the issue was with my model, my loss function, my optimiser settings, or the "plumbing" of the training loop itself -- were gradients actually coming through and being applied to the parameters? I could print out the loss and the gradients, but
Thoughts on Role Confusion
Jun 24, 2026
The other day, I came across "Prompt Injection as Role Confusion" (via Simon Willison). It's a really interesting blog-style version of a paper by Charles Ye, Jasmine Cui and Dylan Hadfield-Menell, where they find that LLMs seem to almost ignore 'role' tags like <system>, <user> or <think>, and instead use the tone of text to infer roles. This seems to explain a lot of jailbreaks. The paper When
Writing an LLM from scratch, part 34a -- building a JAX training loop for an LLM training run
Jun 30, 2026
For over a year, I've been using Sebastian Raschka's book "Build a Large Language Model (from Scratch)" -- and the multitude of side-projects that have branched out from reading it -- as something like a curriculum for learning about modern AI. The one final task I had set myself was to build and train an LLM from scratch just using my notes -- no reference to the book, no reference to the model
Writing an LLM from scratch, part 34b -- from bigrams to GPT-2, one component at a time (in JAX)
Jul 08, 2026
This post is the capstone of the most long-running series on my blog. In December 2024 (!), I started reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and worked through it carefully. Being who I am, despite trying to apply a strict "no side quests" policy, I found myself zooming off and digging into all kinds of things. It's time to wrap it up. I had decided that
poppy the training box, part 1: the beginnings
Jul 09, 2026
For a while I've been planning to put together a separate machine for local LLM training. Until now, I've been using my desktop PC, perry. I have an RTX 3090 installed, and can get useful training runs done (most recently, a 163M-parameter GPT-2 small style LLM in JAX), but there are a couple of problems. perry is my daily driver. If he's doing a training run, then everything is just a little b
Building intuition about LLM parameter counts
Jul 10, 2026
When I was building my GPT-2 implementation in JAX, I started with just token embeddings for the input, and a separate output head (as I was not using weight tying). It wasn't an LLM -- no Transformer blocks, no attention, no feed-forward networks. I was somewhat surprised when I noticed that even that stripped-down model had 77 million parameters with the "small" settings I was using to train --