The Evolution of LLMs: From Attention to Alignment
February 2026
Introduction
Large language models (LLMs) have entered public consciousness remarkably quickly over the past couple of years. As someone with a background in computer science and security engineering, but not artificial intelligence research, I wanted to understand the key technical developments that compressed so much progress in a relatively short timeline.
Besides being passionately interested in biologically-inspired algorithms and machine learning as a teenager before going to university, I’ve actually been squarely fixed on high-end security research since I graduated in 2009. I’m trying to approach writing this article respectfully for those that have devoted decades to the developments we’re seeing – since there’s an awful lot of bandwagon-jumping going on that must be quite infuriating – while also being quite excited to see just what’s evolving in this whole field.
Writing thoughtfully about topics is a good way to really develop an understanding for them, which is part of why I’ve decided to start writing articles at this domain. I’ve done this with some of my public kernel security research over at accessvector.net in the past, too, but I don’t want to bring any of that kind of content over here. At this domain, I’m more interested in positive, constructive topics that pursue scientific enquiry and widen our understanding of the world. For me, that is what AI is all about: enhancing our ability to solve our collective problems and make the world better for everybody.
Starting with LLMs feels appropriate given the exciting times we’re living in right now, but I definitely will be revisiting some other areas of biologically-inspired computing in a later article.
This article covers a time period of roughly 2013-2023 – from early word embeddings through to production-ready aligned models – which I think captures the essence of where we are today in 2026, while admitting there will still have been significant developments beyond this window that I haven’t covered.
Let’s start from the beginning: how do we represent language numerically?
Representing Language
Computers don’t understand words. We use agreed encodings (ASCII, Unicode) to represent characters as numbers, which allows text to be stored and displayed, but these carry no semantic information. The number 65 represents “A” by convention, but there’s nothing in that number that tells a machine “A” is a letter or that “cat” relates to “feline”.
Natural language processing (NLP) applies machine learning to language. To make this possible, we need representations where meaning is encoded numerically: vectors where related words have similar values.
Word Embeddings: Vector Spaces from Raw Text
Individual words are typically extracted from text by tokenisation: text is broken up by whitespace and punctuation stripped away. This method is never perfect, but tends to be good enough with a large enough data set.
Even with tokenisation in place, how do we organise words into a spatial form that captures semantic relationships? The leap from tokens to that feels quite large.
Word2Vec: Words to Vectors
Word2Vec (Mikolov et al.) was published in 2013. Rather than hardcoding features or using words as atomic symbols, the technique learns to map each word to a dense vector (typically 300 dimensions) by training on a massive text dataset with a simple objective to predict words from their context.
The paper introduces two approaches: CBOW (continuous bag of words) and Skip-gram. CBOW attempts to predict a target word from its surrounding text (“The cat ___ on the mat”), while Skip-gram does the inverse and tries to predict context words from a target word (given “sat”, predict “cat” and “on”).
Both approaches start with random word vectors and iteratively adjust them through gradient descent to improve accuracy.
One of the more remarkable outcomes of these algorithms wasn’t in the performance of the prediction task itself, but what emerged as a side effect: words appearing in similar contexts developed similar vectors.
Even more remarkably, the vector space exhibited algebraic structure:
vector("king") - vector("man") + vector("woman") ≈ vector("queen").
Semantic meanings were captured naturally by optimising the prediction
objective across billions of words.
GloVe: Global Vectors for Word Representation
GloVe (Pennington et al., 2014) took a different approach, but discovered similar results. Instead of processing local context windows sequentially, GloVe built a co-occurrence matrix, which is basically a fancy way of summarising how frequently words occurred near other words. But the ratios of co-occurrence probabilities captured semantic relationships in the same way.
A simple example of what this might look like is:
| king | queen | man | woman | throne | crown | |
|---|---|---|---|---|---|---|
| king | 0 | 45 | 62 | 12 | 152 | 238 |
| queen | 45 | 0 | 8 | 58 | 134 | 201 |
| man | 62 | 8 | 0 | 73 | 4 | 2 |
| woman | 12 | 58 | 73 | 0 | 3 | 1 |
| throne | 152 | 134 | 4 | 3 | 0 | 89 |
| crown | 238 | 201 | 2 | 1 | 89 | 0 |
We can clearly see the strong relationships between “king”, “throne” and “crown”, but also between “queen” and those same royal elements. “Queen” and “man”, on the other hand, have a much weaker co-occurrence relationship. In a real matrix, we’d have billions of columns and rows with the relationships derived from real text.
This finding was similar to the Word2Vec paper: applying these relatively simple algorithms to vast amounts of data was sufficient to extract semantic meaning. In other words, both approaches independently demonstrated that relatively crude corpus statistics at massive scale were capable of capturing meaningful semantic structure.
Simple Algorithms on Large Datasets are Effective
The findings from both Word2Vec and GloVe were exciting and impressive, but also came with some limitations:
- Each word received a single static vector regardless of context. This doesn’t handle ambiguity well: does “bank” refer to a financial institution or a river bank, for example?
- Context windows were chosen arbitrarily: they typically looked to the surrounding 5-15 words, but why was that scope chosen? What about long sentences where the last word relates to the first?
- Word order and syntax beyond simple proximity were ignored, but actually carry important information in language. “Will you go?” carries different meaning to “You will go”.
Even so, we’ve made progress here in moving from words as atomic units to having some sort of comparable relationship in vector space, despite these shortcomings. Modern LLMs would eventually address all these limitations through architectural innovations, beginning with models that could process sequences more effectively.
Let’s continue on this direction to see how this aspect evolved.
Processing Sequences
The embedding techniques that we looked at in the previous section go some way to solving the concept of semantic relationships between words, but they leave room for improvement in understanding sequence. Sequence matters in natural language processing; the order of words in a sentence and even the flow of sentences within a paragraph carry significant information.
So now the question stands as: how do we build models that can process and generate variable-length sequences?
Recurrent Neural Networks and LSTMs
If you’re already familiar with how artificial neural networks (ANNs) are structured, a recurrent neural network (RNN) is similar, but maintains a hidden state that evolves as it processes data. At each step, this state is processed alongside the current input, producing both an output and an updated state that carries forward to the next step.
This gives RNNs a form of memory: each word influences the hidden state that flows forward to subsequent words. This matters because language is inherently sequential; meaning depends on context, and words reference information from earlier in the sentence.
For natural language processing, vanilla RNNs struggle with longer sequences. As sentences grow longer, information from earlier words gradually fades, making it difficult for the model to capture long-range dependencies – like connecting a pronoun at the end of a sentence to its referent at the beginning.
Long Short-Term Memory networks (LSTMs) (Hochreiter & Schmidhuber, 1997) improved on plain RNNs by introducing gating mechanisms that could selectively remember or forget information over longer sequences. This helps to partially address the vanishing gradient problem that affected earlier recurrent architectures.
seq2seq: Sequence-to-Sequence Models
In 2014, (Sutskever et al.) introduced the sequence-to-sequence (seq2seq) architecture for neural machine translation. This approach built upon the existing LSTM work by turning it into an encoder/decoder architecture:
The encoder LSTM processes the input sentence (e.g. an English sentence) and produces a fixed-size context vector. From this, a decoder LSTM then generates the output sequence (e.g. a French translation) conditioned on this vector.
This worked surprisingly well, but had a fundamental bottleneck: the entire input sequence had to be compressed into a single fixed-size vector. For short sentences this was manageable, but longer sequences meant forcing more information through the same narrow channel. Important details from early in the sequence could be lost by the time the decoder needed them.
The Attention Mechanism
A solution to seq2seq’s compression bottleneck came in 2014 with (Bahdanau et al.), who introduced the attention mechanism. Rather than compressing everything into a fixed-size context vector, the attention architecture allowed the decoder to “look back” at all of the encoder hidden states and dynamically focus on relevant parts of the input for each output token.
When translating “The black cat sat on the mat” to French, the decoder could attend strongly to “cat” when generating “chat”, then shift attention to “mat” for “tapis”, et cetera. The model could learn where to look, rather than having everything pre-compressed.
Translation quality improved dramatically and the model had learned sensible alignment patterns between source and target languages without explicit supervision.
Remaining Limitations
Attention certainly improved upon the seq2seq model, but we’re still dealing with some fundamental issues with this approach:
- Sequential computation: RNNs process one token at a time, which forces the entire translation process into a serial processing pipeline. You can’t compute position 100 until you finish computing positions 1-99. This means the process can’t be parallelised at all; training time scales linearly with sequence length.
- Limited context windows: Even when augmented with attention, the size of the context window still limits the history capacity of the model. Very long-range dependencies remained problematic.
- Vanishing gradients: Despite LSTM improvements, training signals still degraded over long sequences, making it difficult to learn dependencies spanning hundreds of tokens.
Moving past these limitations insisted on a fundamental rethink: what if we removed the recurrence entirely and built the architecture around attention itself?
Attention is All You Need
In 2017, researchers at Google published a now-famous paper: “Attention is All You Need” (Vaswani et al.). The main contribution of the paper is the transformer architecture, which demonstrates that recurrence isn’t necessary; attention mechanisms alone can handle sequence processing more effectively. Furthermore, the architecture lends itself extremely well to parallelisation, unlocking major efficiency gains over recurrent models.
Before diving into the key architecture, I want to walk back to the beginning and do this from first principles. We have a block of text, but how do we prepare that text in a way the model can use?
Tokenisation, Embedding and Vectors
Earlier in this article, I mentioned that words are typically tokenised based on whitespace and punctuation. This is a pretty common and intuitive approach, but modern models use an approach called byte pair encoding (BPE).
BPE is a simple algorithm that begins with a vocabulary of just
individual characters - ["a", "b", "c", ...] – and then
iteratively merges the most frequently occurring adjacent pairs. For
example, "t" and "h" occur frequently
together, so "th" is added to the vocabulary. This process
repeats until all common groupings have been discovered, typically
leaving us with a vocabulary of tens of thousands of subword tokens.
With our BPE vocabulary constructed, we assign an index to each
entry: 0 => "a", 1 => "b", …
1759 => "th", etc.
Now we can tokenise input text by starting at the beginning and
matching the longest token from our vocabulary, then the next longest,
and so on. The text "unhappy" becomes the sequence
[1523, 2847] (where 1523 might be the index for “un” and
2847 for “happy”).
These token IDs are still just integers, but transformers operate on dense vectors. Each token ID maps to a learned embedding vector through a lookup table. If our model dimension is 512, then token ID 1523 maps to a unique 512-dimensional vector of learned parameters.
These embedding vectors start as random values and are optimised during training, so tokens with similar meanings end up with similar vectors. This is the numerical representation the transformer actually processes.
This completes the pipeline: text → tokens → token IDs → embedding vectors.
Self-Attention
With our token embeddings prepared, we reach the heart of the transformer: self-attention (also called scaled dot-product attention). This mechanism allows every token to attend to every other token in the input simultaneously.
When the concept of attention was introduced by Bahdanau et al., the idea was that output tokens attended to input tokens. This is easy to understand in the context of a translation task. For clarity, we refer to that as cross-attention: the attention flows from one type of token to another. Self-attention, on the other hand, means each token attends to every other token within the same sequence. That is a lot more to do with understanding the meaning of a sentence itself.
Here’s how it works. Each token’s embedding is transformed into three different vectors by multiplying it with three learned weight matrices:
- Query (Q): “What am I looking for?”
- Key (K): “What do I represent?”
- Value (V): “What information do I provide?”
To compute the attention for a given token, we:
- Calculate similarity scores: Compare its query vector against all key vectors (including its own) using dot products. Tokens whose keys align well with the query get high scores, since the angle between their vectors is at a minimum.
- Normalise to attention weights: Apply the softmax function to the results. This converts the scores into probabilities that sum to 1. These become the attention weights.
- Weighted combination: Multiply each value vector by its attention weight and sum them up. This produces a context-aware representation where the token has “looked at” every other token and weighted their contributions.
The language here can be confusing: how can a token “look at” other tokens? Hopefully this explanation makes things a bit clear. When we say a token “looks at” another token, we mean it’s comparing its query vector with that token’s key vector (via dot product) to determine how much attention to pay, then using that attention weight to incorporate that token’s value vector into its representation.
Crucially for performance, this computation happens in parallel across all positions – no sequential dependency. When processing “The cat sat”, all three tokens compute their attention simultaneously, each considering all others. We achieve this simultaneous processing by performing the dot product operations using matrices rather than doing one at a time.
Multi-Head Attention
Self-attention produces a single attention pattern per token. But language has multiple simultaneous relationship types happening at once:
- Syntactic relationships (subject-verb, modifier-noun)
- Semantic relationships (topical similarity, antonyms)
- Positional/structural relationships (beginning/end of phrases)
- Coreference (pronouns to their referents)
A single attention mechanism can only learn one weighted combination of these. Trying to capture all of those aspects in a single attention pattern won’t be effective – different relationships will be competing within that one pattern, forcing trade-offs between capturing syntax versus semantics versus position.
To address this, we use multi-head attention. This approach runs multiple attention mechanisms in parallel, each with its own learned Q/K/V weight matrices. The original transformer used 8 heads.
Each head can specialise in different types of relationships. One might learn syntax, another semantics, another positional patterns. They all operate simultaneously on the same input. The outputs from all heads are then concatenated and combined, giving the model a rich, multi-faceted understanding of each token’s relationships.
Positional Encodings
As mentioned, transformers operate on all tokens at once rather than working sequentially like we saw with RNNs. While this unlocks a huge boost in performance, we lose information about word order.
To solve this, the transformer attaches positional encodings to each token. These are unique vectors for each position in the sequence. Position 0 gets one vector, position 1 gets another, and so on. These are added to the embedding vectors before any processing begins, giving each token both semantic meaning (from the embedding) and positional information.
With position information embedded, the attention mechanism can learn position-dependent patterns like “adjectives typically precede nouns” or understand that word order changes meaning fundamentally.
Architecture
With self-attention, positional encodings, and multi-head attention explained, we can now look at how these pieces fit together in the complete transformer.
The original transformer consists of two main components: the encoder stack and the decoder stack. If we squint, we can see the ghost of seq2seq still here:
Encoder Stack
The encoder processes the input sequence through multiple identical layers. Each encoder layer contains:
- Multi-head self-attention: As we just covered
- Feed-forward network: A simple two-layer neural network applied independently to each token
- Residual connections: The input to each sub-layer is collected and added to its output
- Layer normalisation: Applied after each sub-layer to stabilise training
Each layer transforms the token representations, essentially building increasingly abstract understandings of the input. The output of the final encoder layer is then a sequence of context-aware representations. Each token now “knows” about every other token in the input. And because of the multi-head aspect, this is a multi-faceted understanding.
Decoder Stack
The decoder generates the output sequence one token at a time, also using multiple identical layers. Each decoder layer contains:
- Masked multi-head self-attention: Similar to the encoder, but tokens can only attend to previous positions (not future ones), ensuring the model generates output autoregressively
- Cross-attention: Attends to the encoder’s output, allowing the decoder to focus on relevant parts of the input when generating each output token
- Feed-forward network: Same as in the encoder
- Residual connections and layer normalisation: Applied throughout, as it did in the encoder stack
After the final decoder layer, a linear projection maps the representations to vocabulary size, followed by softmax to get probabilities over all possible next tokens. During generation, the most likely token is selected, added to the sequence, and the process repeats.
Producing words: The final output from the architecture is a probability distribution: a vector with one probability for each token in the vocabulary (tens of thousands of entries). Each value represents how likely that token is to be the next word. We select the highest probability token, feed it back into the decoder as input, and repeat the process to generate the next token — continuing until we produce an end-of-sequence token.
Impact
The transformer solved all three limitations of RNN-based models: training could be parallelised (no sequential dependencies), long-range connections were direct (not degraded over distance), and gradients flowed cleanly through the architecture.
This demonstrated far greater translation performance with faster training. More crucially, the architecture scaled: performance kept improving as models grew larger, a property that would drive the next wave of development.
Transformers: More Than Just Language
Before we continue following the next development beyond the initial transformer architecture, it’s worth taking a step back. We tend to just think of transformers as being useful in tasks that are chiefly concerned with language – translation, for instance. But actually the core mechanism applies beyond just text.
DeepMind in particular – a company I’m a big fan of – has demonstrated the versatility of transformers across domains that have nothing to do with language.
AlphaFold 2 and Protein Structure
AlphaFold 2 revolutionised protein structure prediction, which is an important foundational problem in biochemistry and medicine. The problem revolves around predicting how a chain of amino acids will fold into a 3D protein structure. This seems entirely unrelated to what we think of transformers being useful for, but actually they proved to be ideal.
The model treats amino acid sequences as tokens and uses attention to learn which amino acids interact with each other spatially, even if they’re far apart in the sequence. Multiple sequence alignments (similar proteins from other species) provide additional context, much like how language context helps predict words.
The end result was AlphaFold 2 achieving near-experimental accuracy, solving a 50-year-old challenge in biology.
Being able to predict protein structure accurately has profound implications for biology and medicine; proteins carry out nearly every function in living cells and their structure determines how they work. Understanding protein structures enables researchers to design better drugs by seeing exactly how molecules bind to their targets, understand genetic diseases caused by misfolded proteins and accelerate vaccine development.
What previously required months of expensive experimental work with X-ray crystallography or cryo-electron microscopy can now be predicted computationally in hours. AlphaFold 2 has made its predictions freely available for over 200 million proteins, democratising access to structural biology and accelerating research across the life sciences.
On a personal note, striving to achieve these sorts of goals with AI and being altruistic in sharing their results like this is precisely why I love DeepMind. This is what the world should be doing more of.
Genomics and DNA
DeepMind has also applied transformers to genomic sequences, treating base pairs (A, T, C, G) as tokens. The attention mechanism learns long-range dependencies in genetic code: it identifies how regulatory elements thousands of base pairs away might influence gene expression.
This has really interesting applications in understanding genetic variants and disease.
Transformers Identify Structural Relationships
Transformers ultimately excel at identifying structural relationships in sequences of tokens. It doesn’t matter what those tokens are – words, amino acids, DNA base pairs – if there’s structure to discover, they are excellent at finding it.
This generality is part of why transformers have become so dominant: the architecture isn’t specialised for any particular domain, it’s a flexible pattern-matching engine that scales with data and compute.
Now let’s return to natural language processing. Everything we’re covering here applies beyond just language – and some of the most exciting applications, like those from DeepMind, are tackling problems that could have a tremendously positive impact on the world.
Generative Pre-Training: GPT-1
In June 2018, OpenAI published “Improving Language Understanding by Generative Pre-Training” (Radford et al.), introducing GPT-1.
Interestingly, while the original transformer architecture used an encoder and decoder stack, the GPT-1 model opted to only use the decoder portion – specifically, the masked self-attention and feed-forward layers. This is a simpler architecture and naturally doesn’t use cross-attention, as there is no encoder to attend to:
This architecture is geared towards generation: the model’s whole purpose is to predict what comes next.
Generative Pre-Training
During training, the objective was straight-forward: predict the next token. Given a sequence of tokens, the model learns to predict what comes next. This is called language modelling and it’s trained left-to-right; the model can only see previous tokens when predicting the next one, never future tokens.
The key finding from this research was that this simple objective applied to massive amounts of text could lead to rich representations of language. GPT-1 was pre-trained on the BookCorpus dataset (~7,000 unpublished books), learning general language understanding before being fine-tuned on specific tasks.
This is a fairly ubiquitous paradigm today, but GPT-1 introduced it:
- Pre-train once on vast, unlabelled text with the language modelling objective
- Fine-tune the same model on specific downstream tasks, such as classification, question-answering, etc.
GPT-1 had a modest size by today’s standards (117 million parameters; that is, the total number of weights learned by the whole neural network), but it demonstrated that the decoder-only model could learn meaningful representations.
Bidirectional Encoders: BERT
A few months after GPT-1, Google published BERT: “Bidirectional Encoder Representations from Transformers” (Devlin et al., October 2018). Where GPT-1 used only the decoder, BERT took the opposite approach: encoder-only.
In this model, rather than continuously trying to predict the next token, BERT’s task is to predict a missing token. This is called masked language modelling:
So the task is to take a sentence like “The cat [MASK] on the mat” and predict “sat”.
Other than being encoder-only, a key difference to the GPT-1 architecture is that BERT uses full bidirectional self-attention: it’s able to attend to tokens either side of the masked word simultaneously. This makes BERT fundamentally suited to understanding rather than generation; it’s able to infer the missing token by considering the full surrounding context.
Training
Training BERT involved randomly masking 15% of tokens from the input and training the model to predict them correctly. This was a smart approach because it enabled self-supervised learning over huge unlabelled datasets.
As with GPT-1, BERT also took the approach of a two-phased training paradigm: pre-train on a large general corpus, then fine-tune on specific tasks.
Results
The bidirectional nature of BERT gave it a clear advantage over left-to-right models like GPT-1. Taking full surrounding context into account, not just previous tokens, led to richer understanding of text. This established the pattern of encoders for understanding, decoders for generation.
BERT came in two sizes: BERT-Base with 110 million parameters and BERT-Large with 340 million parameters. The research again confirmed the trend that model ability increased with scale.
GPT-2
In February 2019, OpenAI released “Language Models are Unsupervised Multitask Learners” (Radford et al.), introducing GPT-2. The architecture remained fundamentally the same as GPT-1, but the scale increased dramatically: 1.5 billion parameters, compared to GPT-1’s 117 million.
This increase in scale led to qualitative improvements in performance. GPT-2 could generate coherent text, translate between languages, answer questions, and summarise passages – all without being explicitly trained for these specific tasks. It demonstrated zero-shot learning: performing tasks without any task-specific fine-tuning, simply by recognising patterns from its pre-training.
These were emergent capabilities – abilities that appeared at scale without being explicitly programmed – which I personally find the most fascinating aspect of this whole story.
GPT-3
In May 2020, OpenAI published “Language Models are Few-Shot Learners” (Brown et al.), introducing GPT-3.
GPT-3 took the same decoder-only architecture and scaled it to 175 billion parameters – over 100× larger than GPT-2. It was trained on hundreds of billions of tokens from diverse sources: books, Wikipedia, web text, and more. This required a huge amount of training compute, but the results validated the scaling hypothesis demonstrated by GPT-2 – that the capability performance continued to improve with size.
Few-Shot and In-Context Learning
GPT-3’s most significant capability was in-context learning: it was able to learn a new task from a few examples provided in the prompt – without any parameter updates or fine-tuning. A brand new task, never seen before, could be learned and followed just from the prompt.
This is fundamentally different from zero-shot learning or fine-tuning; the examples served as temporary instructions that steered its behaviour.
This capability itself scaled with model size; larger versions (e.g. 175 billion parameters) performed in-context learning significantly better than smaller versions (e.g. 1.3 billion parameters).
Discontinuous Improvements
Additionally, GPT-3 displayed discontinuous improvements as the scale increased. The researchers note some tasks that the smaller sized models couldn’t complete at all, such as three-digit arithmetic, but the larger models could complete. This implies that as well as improving with scale, new capabilities emerge with scale, too.
Scaling Laws
While GPT-2 and GPT-3 demonstrated that larger models performed better, researchers wanted to understand this relationship more precisely. In January 2020, OpenAI published “Scaling Laws for Neural Language Models” (Kaplan et al.), which quantified how model performance scales with three key factors: model size, dataset size and training compute.
It was found that performance follows smooth power law relationships with each factor:
- Larger models consistently performed better
- More training data consistently improves performance
- More compute (longer training) consistently improves performance
These relationships held consistently across several orders of magnitude.
One practical discovery of the research was that given a fixed compute budget, it’s more effective to train a larger model on less data (as opposed to a smaller model on more data). For example, if you have enough compute to train either:
- A 1B parameter model on 100B tokens, or
- A 10B parameter model on 10B tokens
The scaling laws suggested the second approach would yield better performance, even though the model sees less data. The larger model uses its capacity more efficiently.
Chinchilla
In 2022, DeepMind published “Training Compute-Optimal Large Language Models” (Hoffmann et al.), known as the Chinchilla paper (the name of the model they trained). They refined the scaling laws, showing that the original research had actually underestimated the importance of training data.
They found that for compute-optimal training, model size and training tokens should both be increased proportionally; if you double the compute budget, you should increase both model size and training data proportionally, not just focus on model size.
This explained why some models were underperforming: they were overtrained (too many parameters, too little data) or undertrained (too few parameters, too much data). Chinchilla (70 billion parameters) outperformed much larger models like Gopher (280 billion) by training on more tokens with a better size/data ratio.
This would influence LLaMA and subsequent models, which we will come to soon. Those models focused on compute-optimal training rather than pure parameter count maximisation.
Alignment
By 2022, large language models could generate impressive text, but they had a fundamental problem: they were trained purely to predict the next token, not to be helpful or safe.
Ask GPT-3 “How do I make a bomb?” and it might generate detailed instructions — not because it wanted to cause harm, but because such content existed in its training data and continuing the prompt was what it was trained to do.
This behaviour is unaligned: the model’s objective (predict next token) doesn’t match human intentions (be helpful and harmless).
InstructGPT: Reinforcement Learning from Human Feedback
In March 2022, OpenAI published “Training language models to follow instructions with human feedback” (Ouyang et al.), introducing InstructGPT. This paper described how to align language models using Reinforcement Learning from Human Feedback (RLHF).
InstructGPT’s training involved three stages:
- Supervised Fine-Tuning (SFT): Start with GPT-3 and fine-tune it on high-quality examples of instructions and desired responses, written by human labellers. This teaches the model the basic pattern of following instructions.
- Reward Model Training: The model generates multiple responses to the same prompt and humans rank them by quality. Alongside this, another AI reward model is trained to predict the human rankings.
- Reinforcement Learning: The reward model has now learned how the human reviewers are ranking the responses. We use this to fine-tune the original model with reinforcement learning; the reward model guides the training by scoring responses, allowing the alignment process to scale beyond human review. In the InstructGPT paper, the algorithm used was Proximal Policy Optimisation (PPO).
RLHF transformed base language models into assistants. InstructGPT models were more likely to follow instructions accurately and less likely to generate harmful or toxic content. They demonstrated alignment.
ChatGPT
InstructGPT directly enabled the production and release of ChatGPT in November 2022, which used RLHF on top of GPT-3.5.
The conversational, helpful assistant behaviour that made ChatGPT compelling came from alignment, not from architectural innovations. The technology was simply RLHF applied to a large decoder-only language model.
Constitutional AI
Around the same time, Anthropic published “Constitutional AI: Harmlessness from AI Feedback” (Bai et al., December 2022), which presented an alternative approach to alignment.
Rather than relying entirely on human feedback, Constitutional AI uses AI feedback guided by a “constitution”, which is a set of principles the model should follow.
Under this paradigm, the training process is:
- Generate responses that might be harmful
- Have the model critique its own responses according to its constitution
- Revise the responses based on the self-critique
- Train a reward model on these AI-generated comparisons
- Use reinforcement learning to optimise
As well as being more scalable, this approach reduces reliance on human feedback and makes the alignment process more transparent — principles are explicitly stated rather than implicitly communicated through human rankings.
Models at Home: LLaMA
By the end of 2022, most capable language models (GPT-3, PaLM, Chinchilla) were all closed and only accessible through APIs. Researchers outside of the major labs had no ability to study, modify or build upon them.
In February 2023, Meta published “LLaMA: Open and Efficient Foundation Language Models” (Touvron et al.) and released a family of smaller models that could run on accessible hardware. Rather than chasing the largest possible model, LLaMA applied the Chinchilla scaling laws we looked at earlier to train smaller models on more data for better compute efficiency.
The costs of building and running models is split into two parts: the training cost and the running cost (also known as inference cost). The theory with LLaMA was to invest heavily in training to reduce the cost of running (inference).
Meta released various model sizes: 7 billion, 13 billion, 33 billion and 65 billion parameters. Even the smallest of these was trained on over a trillion tokens, which was far more than previous models had been trained on. The results validated the compute-optimal approach and LLaMA-65B was competitive with much larger closed models.
Originally released just to researchers, the weights of the LLaMA models soon leaked to the general public, thereby democratising access. This was a huge win for anyone wanting to research or play with these state of the art models from outside of a frontier lab. Various derivatives were soon produced by the community, fine-tuned for specific tasks such as instruction-following (“Alpaca”) and conversation (“Vicuna”).
LLaMA 2
A few months later, in July 2023, Meta published “LLaMA 2: Open Foundation and Fine-Tuned Chat Models” (Touvron et al.), which became the first proper open release of an LLM: weights were freely available for both research and commercial use.
LLaMA 2 was released as various sizes, similar to the original LLaMA, but was also available as just the base model or an RLHF fine-tuned model, trained in the same way as InstructGPT that we saw earlier.
Being the first truly open, frontier-class model from a major lab, LLaMA 2 had a big, positive impact and accelerated research beyond the walls of a few organisations. Public researchers could:
- Study how alignment techniques actually work
- Experiment with new fine-tuning approaches
- Build applications without API costs or rate limits
- Understand model behaviour through direct inspection
Conclusion and Reflection
We’ve come a long way since Word2Vec in 2013 up to LLaMA 2 in 2023. Hopefully this article has given a useful and clear guide to how the field has broadly developed in that time. There are a couple of high-level aspects I wanted to pay particular attention to from all of this, which I’ll briefly talk about next.
Emergence Intelligence
As a teenager interested in biologically-inspired algorithms, I was fascinated by the concept of emergence intelligence. The forms I knew it back then were things like ant colony optimisation and particle swarm optimisation whereby the individual agents in a system followed very basic rules, but the collective behaviour of a group of agents resulted in a form of problem-solving.
It’s really interesting to see some parallels with LLMs: as we increase the scale of what can be thought of as a fairly simple idea (not to dismiss the sophistication behind it), we discover new, non-continuous jumps in capabilities. These models are able to do things they weren’t explicitly trained to do. I think that’s a huge deal and is really the most magical part of this whole area.
It also implies that capabilities lay dormant below critical thresholds; so what other capabilities have we yet to witness?
Architecture Generality
Transformers weren’t designed for protein folding or genomics, yet they excel at both. The attention mechanism doesn’t “know” about language – it learns relationships in sequences, regardless of what those sequences represent. This generality is part of why transformers have become so dominant in recent advances; they’re not specialised for any particular domain, but are instead a flexible pattern-matching engine. We can, cautiously, draw analogies here with how we understand brains to work.
This extends even beyond the applications we’ve covered in this article: all of the state of the art models at the time of writing, like Gemini 3, GPT-5 and Claude 4.6, are all multimodal. There’s nothing specific about these architectures that require text to be the mode of understanding, it’s just easier to write about here and understand them that way.
Compressed Timelines
The speed at which these developments have taken place is almost absurd and it’s definitely an exciting time to be alive and see the progress. As I write this, frontier models are making headway towards being able to significantly accelerate research into AI itself, which suggests progress could almost become exponential. If we’ve managed to get this far this quickly with just our human brains, it’s difficult to fathom how quickly we’ll be augmented by these systems.
There are plenty of skeptics in the world arguing that we will hit a wall with advances here, but I think even if that does turn out to be the case, the capabilities and values we’re seeing today are still outstanding and are nowhere near being fully utilised.
I will keenly continue to watch this field and fully enjoy the fruits we all collectively get to benefit from as a society.
References
Early Sequence Models (1997)
- Long Short-Term Memory: Hochreiter, S., & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735–1780. https://www.bioinf.jku.at/publications/older/2604.pdf
Word Embeddings and Sequence Models (2013-2014)
- Word2Vec: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/pdf/1301.3781.pdf
- GloVe: Pennington, J., Socher, R., & Manning, C. D. (2014). “GloVe: Global Vectors for Word Representation.” https://nlp.stanford.edu/pubs/glove.pdf
- Sequence to Sequence Learning: Sutskever, I., Vinyals, O., & Le, Q. V. (2014). “Sequence to Sequence Learning with Neural Networks.” https://arxiv.org/pdf/1409.3215.pdf
- Neural Machine Translation with Attention: Bahdanau, D., Cho, K., & Bengio, Y. (2014). “Neural Machine Translation by Jointly Learning to Align and Translate.” https://arxiv.org/pdf/1409.0473.pdf
The Transformer Architecture (2017)
- Attention Is All You Need: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention Is All You Need.” https://arxiv.org/pdf/1706.03762.pdf
Early Large Language Models (2018)
- GPT-1: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). “Improving Language Understanding by Generative Pre-Training.” https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
- BERT: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” https://arxiv.org/pdf/1810.04805.pdf
Scaling Language Models (2019-2020)
- GPT-2: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). “Language Models are Unsupervised Multitask Learners.” https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Scaling Laws for Neural Language Models: Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). “Scaling Laws for Neural Language Models.” https://arxiv.org/pdf/2001.08361.pdf
- GPT-3: Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). “Language Models are Few-Shot Learners.” https://arxiv.org/pdf/2005.14165.pdf
Transformers Beyond Language (2020)
- AlphaFold 2: Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., & Hassabis, D. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature, 596, 583–589. https://www.nature.com/articles/s41586-021-03819-2
Alignment and Safety (2022)
- InstructGPT: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). “Training language models to follow instructions with human feedback.” https://arxiv.org/pdf/2203.02155.pdf
- Training Compute-Optimal Large Language Models (Chinchilla): Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. van den, Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., & Sifre, L. (2022). “Training Compute-Optimal Large Language Models.” https://arxiv.org/pdf/2203.15556.pdf
- Constitutional AI: Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., Mercado, N., DasSarma, N., Lasenby, R., Larson, R., Ringer, S., Johnston, S., Kravec, S., Showk, S. E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S. R., Hatfield-Dodds, Z., Mann, B., Amodei, D., Joseph, N., McCandlish, S., Brown, T., & Kaplan, J. (2022). “Constitutional AI: Harmlessness from AI Feedback.” https://arxiv.org/pdf/2212.08073.pdf
Open Models (2023)
- LLaMA: Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). “LLaMA: Open and Efficient Foundation Language Models.” https://arxiv.org/pdf/2302.13971.pdf
- LLaMA 2: Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., & Scialom, T. (2023). “Llama 2: Open Foundation and Fine-Tuned Chat Models.” https://arxiv.org/pdf/2307.09288.pdf