Latest posts
- Why smart instruction-following makes prompt injection easierNov 12, 2025
Back when I first started looking into LLMs, I noticed that I could use what I've since called the transcript hack to get LLMs to work as chatbots without specific fine-tuning. It's occurred to me that this partly explains why protection against prompt injection is so hard in practice. The transcript hack involved presenting chat text as something that made sense in the context of next-token pred
- Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090Dec 02, 2025
Having worked through the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware? The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that n
- Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloudJan 07, 2026
I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that: I can learn what you need to ch
- Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge resultsJan 09, 2026
I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weigh
- Writing an LLM from scratch, part 31 -- the models are now on Hugging FaceJan 17, 2026
As part of my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally, and four in the cloud. I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something close
- Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)Jan 28, 2026
I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series, and wanted to share them with anyone that was interested. I managed to get it done, but it was kind of tricky to get right. The Hugging Face documentation is great if you're using the built-in models, but the c
- Writing an LLM from scratch, part 32a -- Interventions: training a baseline modelFeb 04, 2026
I'm rounding out my series of posts on Sebastian Raschka's book "Build a Large Language Model (from Scratch)" by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090, and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test datas
- Writing an LLM from scratch, part 32b -- Interventions: gradient clippingFeb 05, 2026
I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the mi
- Writing an LLM from scratch, part 32c -- Interventions: removing dropoutFeb 05, 2026
This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.67
- Writing an LLM from scratch, part 32d -- Interventions: adding attention biasFeb 06, 2026
I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This is the third intervention I'm trying: adding bias to the attention weight matrices. In the code from the book, we have this: class MultiHeadAttention(nn.Module): def __init__( self,