Large Language Models 2013–2023: Architecture, Scale and Alignment

February 2026

Introduction

Large language models (LLMs) have progressed from research artefacts to widely deployed systems within a remarkably short period of time. This article examines the technical developments that enabled that acceleration.

This article covers roughly 2013–2023, from early word embeddings to production-ready aligned models. That window captures the structural shifts that define the current generation of systems.

The emphasis here is on architectural and scaling dynamics rather than individual product releases.

We begin with numerical representations of language.

Representing Language

Computers manipulate symbols, not meaning. Encodings such as ASCII and Unicode map characters to numbers so text can be stored and transmitted, but those mappings contain no semantic structure. The number 65 represents “A” by convention, but nothing in that value encodes that “A” is a letter or that “cat” relates to “feline”.

Natural language processing (NLP) applies machine learning to language. This requires representations in which meaning is encoded numerically: vectors where related words occupy nearby regions in space.

Word Embeddings: Vector Spaces from Raw Text

Words are typically extracted from text through tokenisation: splitting on whitespace and removing punctuation. The process is imperfect, but with sufficient data it is usually adequate.

Tokenisation produces discrete symbols. The remaining problem is to embed those symbols into a space that captures semantic relationships.

Word2Vec: Words to Vectors

Word2Vec (Mikolov et al.) was published in 2013. Rather than hardcoding features or using words as atomic symbols, the technique learns to map each word to a dense vector (typically 300 dimensions) by training on a massive text dataset with a simple objective to predict words from their context.

The paper introduces two approaches: CBOW (continuous bag of words) and Skip-gram. CBOW attempts to predict a target word from its surrounding text (“The cat ___ on the mat”), while Skip-gram does the inverse and tries to predict context words from a target word (given “sat”, predict “cat” and “on”).

Both approaches start with random word vectors and iteratively adjust them through gradient descent to improve accuracy.

The most important outcome of these algorithms was not the prediction task itself, but an emergent property: words appearing in similar contexts developed similar vectors.

The learned vector space exhibited algebraic structure: vector("king") - vector("man") + vector("woman") ≈ vector("queen"). Semantic meanings were captured naturally by optimising the prediction objective across billions of words.

Visualising geometric relationships between words.

GloVe: Global Vectors for Word Representation

GloVe (Pennington et al., 2014) took a different approach, but discovered similar results. Instead of processing local context windows sequentially, GloVe constructed a co-occurrence matrix summarising how frequently words appeared near other words. The ratios of co-occurrence probabilities captured semantic relationships in a similar manner.

A simple example of what this might look like is:

	king	queen	man	woman	throne	crown
king	0	45	62	12	152	238
queen	45	0	8	58	134	201
man	62	8	0	73	4	2
woman	12	58	73	0	3	1
throne	152	134	4	3	0	89
crown	238	201	2	1	89	0

The relationships are visible even in this simplified example: not only are there strong relationships between “king”, “throne” and “crown”, but also between “queen” and those same royal elements. “Queen” and “man”, on the other hand, have a much weaker co-occurrence relationship. In a real matrix, we’d have billions of columns and rows with the relationships derived from real text.

This finding was similar to the Word2Vec paper: applying relatively simple algorithms to large-scale corpora was sufficient to extract semantic structure. In other words, both approaches independently demonstrated that relatively crude corpus statistics at massive scale were capable of capturing meaningful semantic structure.

Simple Algorithms on Large Datasets are Effective

The findings from both Word2Vec and GloVe were significant, but also came with some limitations:

Each word received a single static vector regardless of context. This does not handle lexical ambiguity: does “bank” refer to a financial institution or a river bank, for example?
Context windows were chosen heuristically, typically spanning 5–15 surrounding words.
Word order and syntax beyond simple proximity were ignored, but actually carry important information in language. “Will you go?” carries different meaning to “You will go”.

Despite these limitations, this marked a transition from treating words as atomic symbols to representing them as points in a continuous semantic space. Modern language models address these shortcomings through architectures capable of modelling full sequences rather than fixed context windows.

The next development focused on sequence modelling.

Processing Sequences

The embedding techniques discussed previously capture semantic similarity between words, but they do not model sequence. Order matters in language; the position of words within a sentence and the structure of sentences within a paragraph both carry information.

The next challenge is to build models capable of processing and generating variable-length sequences.

Recurrent Neural Networks and LSTMs

A recurrent neural network (RNN) extends a standard artificial neural network by maintaining a hidden state that evolves as data is processed. At each step, this state is processed alongside the current input, producing both an output and an updated state that carries forward to the next step.

How state flows through a recurrent neural network.

This gives RNNs a form of memory: each token updates a hidden state that is carried forward through the sequence. This matters because language is inherently sequential; meaning depends on context, and words reference information from earlier in the sentence.

In practice, vanilla RNNs struggle with longer sequences. As sentences grow, information from earlier tokens gradually fades, making it difficult to capture long-range dependencies, such as linking a pronoun at the end of a sentence to its referent at the beginning.

Long Short-Term Memory networks (LSTMs) (Hochreiter & Schmidhuber, 1997) improved on plain RNNs by introducing gating mechanisms that could selectively remember or forget information over longer sequences. This partially mitigates the vanishing gradient problem that affected earlier recurrent architectures.

seq2seq: Sequence-to-Sequence Models

In 2014, Sutskever et al. introduced the sequence-to-sequence (seq2seq) architecture for neural machine translation. This approach built upon the existing LSTM work by turning it into an encoder/decoder architecture:

The encoder LSTM processes the input sentence (e.g. an English sentence) and produces a fixed-size context vector. From this, a decoder LSTM then generates the output sequence (e.g. a French translation) conditioned on this vector.

This approach performed well, but had a fundamental bottleneck: the entire input sequence had to be compressed into a single fixed-size vector. For short sentences this was manageable, but longer sequences meant forcing more information through the same narrow channel. Important details from early in the sequence could be lost by the time the decoder needed them.

The Attention Mechanism

In 2014, Bahdanau et al. introduced the attention mechanism as a solution to this compression bottleneck. Rather than compressing everything into a fixed-size context vector, the decoder could access all encoder hidden states and dynamically weight them when producing each output token.

When translating “The black cat sat on the mat” to French, the decoder could attend strongly to “cat” when generating “chat”, then shift attention to “mat” for “tapis”. The model could learn where to look, rather than having everything pre-compressed.

Translation quality improved significantly and the model had learned sensible alignment patterns between source and target languages without explicit supervision.

Remaining Limitations

Attention certainly improved upon the seq2seq model, but we’re still dealing with some fundamental issues with this approach:

Sequential computation: RNNs process one token at a time, which forces the entire translation process into a serial processing pipeline. Position 100 cannot be computed until positions 1–99 have been processed. This prevents full parallelisation; training time scales linearly with sequence length.
Limited context windows: Even when augmented with attention, the size of the context window still limits the history capacity of the model. Very long-range dependencies remained problematic.
Vanishing gradients: Despite LSTM improvements, training signals still degraded over long sequences, making it difficult to learn dependencies spanning hundreds of tokens.

Addressing these limitations required a fundamental architectural change: removing recurrence entirely and building the model around attention.

Attention is All You Need

In 2017, researchers at Google published Attention is All You Need (Vaswani et al.). The paper introduced the transformer architecture, demonstrating that recurrence is not required for sequence modelling. Attention mechanisms alone are sufficient, and the architecture parallelises efficiently on modern hardware.

Before examining the architecture itself, we first consider how text is prepared for transformer models.

Tokenisation, Embedding and Vectors

Earlier in this article, I mentioned that words are typically tokenised based on whitespace and punctuation. This approach is common and intuitive, but modern models typically use byte pair encoding (BPE).

BPE begins with a vocabulary of individual characters and iteratively merges the most frequently occurring adjacent pairs. For example, "t" and "h" occur frequently together, so "th" is added to the vocabulary. This process repeats until all common groupings have been discovered, typically leaving us with a vocabulary of tens of thousands of subword tokens.

With our BPE vocabulary constructed, we assign an index to each entry: 0 => "a", 1 => "b", … 1759 => "th", etc.

Input text is tokenised by greedily matching the longest tokens in the vocabulary from left to right. The text "unhappy" becomes the sequence [1523, 2847] (where 1523 might be the index for “un” and 2847 for “happy”).

These token IDs are integers, but transformers operate on dense vectors. Each token ID maps to a learned embedding vector through a lookup table. If our model dimension is 512, then token ID 1523 maps to a unique 512-dimensional vector of learned parameters.

These embedding vectors start as random values and are optimised during training, so tokens with similar meanings end up with similar vectors. This is the numerical representation the transformer actually processes.

Transforming text into tokens and through into embeddings.

This completes the pipeline: text → tokens → token IDs → embedding vectors.

Self-Attention

With our token embeddings prepared, we reach the heart of the transformer: self-attention (also called scaled dot-product attention). This mechanism allows every token to attend to every other token in the input simultaneously.

When the concept of attention was introduced by Bahdanau et al., the idea was that output tokens attended to input tokens. This is most easily illustrated in translation tasks. For clarity, we refer to that as cross-attention: the attention flows from one type of token to another. Self-attention instead models internal dependencies within a single sequence.

Here’s how it works. Each token’s embedding is transformed into three different vectors by multiplying it with three learned weight matrices:

Query (Q): “What am I looking for?”
Key (K): “What do I represent?”
Value (V): “What information do I provide?”

To compute the attention for a given token, we:

Calculate similarity scores: Compare its query vector against all key vectors (including its own) using dot products. Tokens whose keys align well with the query get high scores, since their dot product is large.
Normalise to attention weights: Apply the softmax function to the results. This converts the scores into probabilities that sum to 1. These become the attention weights.
Weighted combination: Multiply each value vector by its attention weight and sum them up. This produces a context-aware representation where the token has “looked at” every other token and weighted their contributions.

When we say a token “looks at” another token, we mean it’s comparing its query vector with that token’s key vector (via dot product) to determine how much attention to pay, then using that attention weight to incorporate that token’s value vector into its representation.

Crucially, this computation is performed in parallel across all positions. When processing “The cat sat”, all three tokens compute their attention simultaneously, each considering all others. We achieve this simultaneous processing by performing the dot product operations using matrices rather than doing one at a time.

Multi-Head Attention

Self-attention produces a single attention pattern per token. Language contains multiple simultaneous relationship types:

Syntactic relationships (subject-verb, modifier-noun)
Semantic relationships (topical similarity, antonyms)
Positional/structural relationships (beginning/end of phrases)
Coreference (pronouns to their referents)

A single attention mechanism can only learn one weighted combination of these. A single attention mechanism cannot effectively model all of these simultaneously, as competing relationships would share the same weighting pattern.

To address this, we use multi-head attention. This approach runs multiple attention mechanisms in parallel, each with its own learned Q/K/V weight matrices. The original transformer used 8 heads.

Each head can specialise in different types of relationships. Different heads often specialise in syntactic, semantic, or positional patterns and they all operate simultaneously on the same input. The outputs from all heads are then concatenated and combined, giving the model a rich, multi-faceted understanding of each token’s relationships.

Positional Encodings

As mentioned, transformers operate on all tokens at once rather than working sequentially like we saw with RNNs. While this unlocks substantial performance gains, it removes explicit ordering information.

To solve this, the transformer attaches positional encodings to each token. These are unique vectors for each position in the sequence. Position 0 gets one vector, position 1 gets another, and so on. These are added to the embedding vectors before any processing begins, giving each token both semantic meaning (from the embedding) and positional information.

With position information embedded, the attention mechanism can learn position-dependent patterns like “adjectives typically precede nouns” or understand that word order changes meaning fundamentally.

Architecture

With self-attention, positional encodings, and multi-head attention explained, we can now look at how these pieces fit together in the complete transformer.

The original transformer consists of two main components: the encoder stack and the decoder stack. The overall structure retains the encoder–decoder form introduced by seq2seq models:

Encoder Stack

The encoder processes the input sequence through multiple identical layers. Each encoder layer contains:

Multi-head self-attention: As we just covered
Feed-forward network: A simple two-layer neural network applied independently to each token
Residual connections: The input to each sub-layer is collected and added to its output
Layer normalisation: Applied after each sub-layer to stabilise training

Each layer transforms the token representations, essentially building increasingly abstract understandings of the input. The output of the final encoder layer is then a sequence of context-aware representations. Each token representation incorporates information from every other token in the sequence. And because of the multi-head aspect, this is a multi-faceted understanding.

Decoder Stack

The decoder generates the output sequence one token at a time, also using multiple identical layers. Each decoder layer contains:

Masked multi-head self-attention: Similar to the encoder, but tokens can only attend to previous positions (not future ones), enforcing autoregressive generation
Cross-attention: Attends to the encoder’s output, allowing the decoder to focus on relevant parts of the input when generating each output token
Feed-forward network: Same as in the encoder
Residual connections and layer normalisation: Applied throughout, as it did in the encoder stack

After the final decoder layer, a linear projection maps the representations to vocabulary size, followed by softmax to get probabilities over all possible next tokens. During generation, the most likely token is selected, added to the sequence, and the process repeats.

Producing words: The final output from the architecture is a probability distribution: a vector with one probability for each token in the vocabulary (tens of thousands of entries). Each value represents how likely that token is to be the next word. During generation, a token is sampled or selected from this distribution and appended to the sequence. The process repeats until an end-of-sequence token is produced.

Impact

The transformer addressed the key limitations of RNN-based models: computation was parallelisable, long-range dependencies were modelled directly, and gradients propagated more effectively.

Translation performance improved substantially, and training efficiency increased. More importantly, the architecture scaled predictably: performance continued improving as model and dataset sizes increased.

Transformers: More Than Just Language

Before continuing with language models, it is useful to note that the transformer architecture is not limited to text. Though transformers are often associated primarily with language tasks such as translation, the core mechanism applies more broadly.

DeepMind has demonstrated the versatility of transformers across domains that have nothing to do with language.

AlphaFold 2 and Protein Structure

AlphaFold 2 significantly advanced protein structure prediction, which is a foundational problem in biochemistry and medicine. The problem revolves around predicting how a chain of amino acids will fold into a 3D protein structure. Although this domain appears unrelated to language, the underlying modelling problem is still sequential.

The model treats amino acid sequences as tokens and uses attention to model long-range spatial interactions between residues, even when they are distant in sequence order. Multiple sequence alignments (similar proteins from other species) provide additional context, much like how language context helps predict words.

AlphaFold 2 predicts 3D protein structure from amino acid sequences using transformers.

AlphaFold 2 achieved accuracy approaching experimental methods, addressing a longstanding challenge in structural biology.

Accurate protein structure prediction has significant implications for biology and medicine; proteins carry out nearly every function in living cells and their structure determines how they work. Understanding protein structures enables researchers to design better drugs by seeing exactly how molecules bind to their targets, understand genetic diseases caused by misfolded proteins and accelerate vaccine development.

Structures that previously required months of experimental work using X-ray crystallography or cryo-electron microscopy can now often be predicted computationally in hours. AlphaFold 2 has made its predictions freely available for over 200 million proteins, substantially expanding access to structural data across the life sciences.

Genomics and DNA

Transformers have also been applied to genomic sequences, treating base pairs (A, T, C, G) as tokens. The attention mechanism learns long-range dependencies in genetic code: it identifies how regulatory elements thousands of base pairs away might influence gene expression.

This has applications in understanding genetic variation and disease mechanisms.

Transformers Identify Structural Relationships

Transformers are effective at modelling structural relationships in sequences of tokens. The tokens may represent words, amino acids, or DNA base pairs; the mechanism is unchanged and when statistical structure is present, attention mechanisms can learn it.

This generality is part of why transformers have become so dominant: the architecture is not domain-specific, it is a general-purpose sequence modelling framework that scales with data and compute.

We now return to language modelling. The architectural principles described here are domain-agnostic; they apply wherever sequential structure can be represented as tokens.

Generative Pre-Training: GPT-1

In June 2018, OpenAI published Improving Language Understanding by Generative Pre-Training (Radford et al.), introducing GPT-1.

The original transformer architecture contained both encoder and decoder stacks. GPT-1 instead used only the decoder stack, retaining masked self-attention and feed-forward layers. This simplified architecture omits cross-attention, as there is no encoder representation to attend to:

This configuration is inherently generative: the model is trained to predict the next token in a sequence.

Generative Pre-Training

During training, the objective was simple: predict the next token. This is known as language modelling and is trained left-to-right: the model conditions only on previous tokens and never on future ones.

The central finding was that this simple objective, applied to large-scale text corpora, produced rich internal representations of language. GPT-1 was pre-trained on the BookCorpus dataset, comprising approximately 7,000 unpublished books, and then fine-tuned on supervised downstream tasks.

This paradigm is now standard, but GPT-1 formalised it:

Pre-train on large-scale unlabelled text using the language modelling objective
Fine-tune the same model on specific downstream tasks, such as classification, question-answering, etc.

Crucially, this approach decoupled representation learning from task-specific supervision. General language structure was learned first, task behaviour was learned later.

GPT-1 contained 117 million parameters, modest by current standards, but it demonstrated that a decoder-only transformer could learn transferable language representations.

Bidirectional Encoders: BERT

A few months after GPT-1, Google published Bidirectional Encoder Representations from Transformers* (Devlin et al., October 2018), introducing BERT. Whereas GPT-1 used a decoder-only architecture, BERT adopted an encoder-only design.

Instead of predicting the next token in sequence, BERT is trained to predict masked tokens within the input. This is called masked language modelling:

For example, given a sentence such as “The cat [MASK] on the mat”, the model must predict “sat”.

A key architectural distinction from GPT-1 is that BERT uses full bidirectional self-attention, allowing each token to attend to tokens on both sides simultaneously. This makes BERT well suited to representation learning and classification tasks rather than autoregressive generation.

Training

Training BERT involved randomly masking 15% of tokens from the input and training the model to predict them correctly. This enabled large-scale self-supervised learning over unlabelled corpora.

As with GPT-1, BERT followed the two-phase paradigm: pre-train on a general corpus, then fine-tune on downstream tasks.

Results

Bidirectional attention provided advantages over strictly left-to-right models such as GPT-1 on many benchmark tasks. Taking full surrounding context into account, not just previous tokens, produced stronger contextual representations. This reinforced the architectural pattern of encoder models for contextual representation and decoder models for generation.

The distinction between bidirectional encoding and autoregressive decoding would remain central to transformer development, influencing how models were deployed across tasks.

BERT was released in two sizes: BERT-Base (110 million parameters) and BERT-Large (340 million parameters). Performance improved with model size, reinforcing the emerging relationship between scale and capability.

GPT-2

In February 2019, OpenAI released “Language Models are Unsupervised Multitask Learners” (Radford et al.), introducing GPT-2. The architecture remained largely unchanged from GPT-1, but the model scale increased substantially: 1.5 billion parameters compared to 117 million.

The increase in scale produced marked improvements across a range of tasks. GPT-2 generated extended passages of text, performed translation, answered questions, and summarised content without task-specific fine-tuning. It demonstrated zero-shot learning: performing tasks without gradient updates on task-specific data, relying solely on patterns learned during pre-training.

These behaviours are often described as emergent capabilities: abilities that appear once models reach sufficient scale, despite not being directly optimised for those tasks during training. This suggested that scale alone could induce qualitatively new behaviour without architectural change.

Importantly, the training objective remained unchanged: next-token prediction. The shift in behaviour was attributable primarily to model and data scale.

GPT-3

In May 2020, OpenAI published “Language Models are Few-Shot Learners” (Brown et al.), introducing GPT-3.

GPT-3 retained the decoder-only architecture and scaled it to 175 billion parameters, more than 100× larger than GPT-2. It was trained on hundreds of billions of tokens from diverse sources: books, Wikipedia, web text, and more. Training required substantial computational resources, but the results reinforced the scaling trend observed with GPT-2: performance continued improving as model size increased.

Few-Shot and In-Context Learning

A defining capability of GPT-3 was in-context learning: it was able to learn a new task from a few examples provided in the prompt without any parameter updates or fine-tuning. The model could adapt to novel tasks using only examples embedded in the prompt.

This differs from both zero-shot learning and fine-tuning: the examples act as transient conditioning rather than triggering parameter updates.

In-context learning performance improved with model size; larger variants performed substantially better than smaller ones.

Discontinuous Improvements

Some tasks exhibited non-linear improvements as scale increased. For example, smaller models failed at certain three-digit arithmetic tasks, whereas larger models succeeded. This suggests that scaling does not merely improve existing behaviour; it can enable qualitatively new behaviours once certain capacity thresholds are crossed.

Importantly, the training objective remained unchanged. Architectural continuity combined with scale was sufficient to produce these behavioural shifts.

Scaling Laws

GPT-2 and GPT-3 suggested that larger models performed better, but the relationship required formal characterisation. In January 2020, OpenAI published Scaling Laws for Neural Language Models (Kaplan et al.), which quantified how performance scales with three primary variables: model size, dataset size and training compute.

The study found that loss decreases according to smooth power-law relationships with respect to each variable:

• Increasing model parameters reduces loss predictably • Increasing training data reduces loss predictably • Increasing training compute reduces loss predictably

These relationships held consistently across several orders of magnitude.

A practical implication was that given a fixed compute budget, it is more effective to train a larger model on less data than a smaller model on more data, given a fixed compute budget. For example, if you have enough compute to train either:

A 1B parameter model on 100B tokens, or
A 10B parameter model on 10B tokens

The scaling laws predicted the second configuration would achieve lower loss, even though the model sees less data. The larger model utilises its representational capacity more effectively under the same compute constraint.

Chinchilla

In 2022, DeepMind published Training Compute-Optimal Large Language Models (Hoffmann et al.), commonly referred to as the Chinchilla paper. The study refined the earlier scaling analysis, demonstrating that optimal performance depends on the joint scaling of model size and training data.

For compute-optimal training, both model size and training tokens must increase proportionally as compute increases.

This clarified why some models underperformed: they were either over-parameterised relative to data or under-parameterised relative to compute. The 70-billion-parameter Chinchilla model outperformed the 280-billion-parameter Gopher model by training on substantially more tokens with a better parameter-to-data ratio.

These findings influenced subsequent models, including LLaMA, which prioritised compute-optimal training over maximising parameter count alone.

Alignment

By 2022, large language models could generate fluent and contextually appropriate text, but they retained a fundamental limitation: they were optimised purely for next-token prediction, not for helpfulness or safety.

For example, prompting GPT-3 with “How do I make a bomb?” could yield detailed instructions, not because the system intended harm, but because such continuations existed in its training data and next-token prediction was its only objective.

This behaviour is described as unaligned: the training objective (predict the next token) does not reflect the desired behavioural objective (be helpful and harmless).

InstructGPT: Reinforcement Learning from Human Feedback

In March 2022, OpenAI published Training language models to follow instructions with human feedback (Ouyang et al.), introducing InstructGPT and formalising reinforcement learning from human feedback (RLHF) for large language models.

InstructGPT’s training involved three stages:

Supervised Fine-Tuning (SFT): Begin with a pretrained GPT-3 model and fine-tune it on high-quality instruction–response pairs produced by human annotators.
Reward Model Training: The model generates multiple responses to the same prompt and humans rank them by quality. A separate reward model is trained to predict these human preference rankings.
Reinforcement Learning: The reward model is then used to optimise the original language model using reinforcement learning. Responses are scored by the reward model, and the policy is updated to increase expected reward while remaining close to the pretrained distribution. In the InstructGPT paper, Proximal Policy Optimisation (PPO) was used for this step.

InstructGPT: Reinforcement Learning with Human Feedback

RLHF shifted base language models toward assistant-style behaviour. InstructGPT models were more likely to follow instructions accurately and less likely to generate harmful or toxic content. They more closely aligned model outputs with human preferences.

ChatGPT

InstructGPT directly enabled the release of ChatGPT in November 2022, which applied RLHF to a GPT-3.5 base model.

The conversational assistant behaviour associated with ChatGPT arose primarily from alignment techniques rather than architectural changes. The underlying architecture remained a large decoder-only transformer; the behavioural shift was driven by RLHF.

Constitutional AI

Around the same time, Anthropic published Constitutional AI: Harmlessness from AI Feedback (Bai et al., December 2022), proposing an alternative alignment method.

Instead of relying solely on human preference data, Constitutional AI incorporates model-generated critiques guided by an explicit set of principles.

Under this paradigm, the training process is:

Generate potentially harmful responses
Critique responses according to a predefined constitution
Revise responses based on the critique
Train a reward model on these comparisons
Optimise the policy using reinforcement learning

Constitutional AI: Reinforcement Learning with AI Feedback

This approach reduces reliance on human annotation and makes the alignment criteria explicit: principles are codified directly rather than inferred from aggregated human rankings.

Models at Home: LLaMA

By the end of 2022, the most capable language models (GPT-3, PaLM, Chinchilla) were closed and accessible primarily via APIs. This limited the ability of external researchers to study, modify, or extend these systems.

In February 2023, Meta published LLaMA: Open and Efficient Foundation Language Models (Touvron et al.) releasing a family of comparatively smaller models designed to run on accessible hardware. Rather than maximising parameter count, LLaMA applied the Chinchilla scaling principles to train smaller models on substantially more data for improved compute efficiency.

The cost of large language models can be divided into training cost and inference cost. LLaMA prioritised higher training investment to reduce downstream inference cost.

Model sizes ranged from 7 billion to 65 billion parameters. The smallest model was trained on over one trillion tokens, substantially more than many earlier models. The results supported the compute-optimal training approach, and LLaMA-65B was competitive with substantially larger closed models.

Although initially released under restricted terms, the model weights were subsequently leaked and became widely available. This enabled independent researchers and practitioners to experiment with high-capability models outside major laboratories. Community fine-tuned derivatives emerged, including instruction-tuned variants such as Alpaca and conversational models such as Vicuna.

LLaMA 2

In July 2023, Meta published LLaMA 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al.), providing a broader release of model weights for research and commercial use. LLaMA 2 represented one of the first major frontier-class models released with broad usage permissions.

LLaMA 2 was released in multiple sizes and offered both base and RLHF-aligned variants.

The broader availability of LLaMA 2 enabled external researchers to examine alignment techniques directly, experiment with fine-tuning strategies, build applications without API constraints, and analyse model behaviour through weight-level inspection. Public researchers could:

Study how alignment techniques actually work
Experiment with new fine-tuning approaches
Build applications without API costs or rate limits
Understand model behaviour through direct inspection

Conclusion

From Word2Vec in 2013 to LLaMA 2 in 2023, the dominant pattern has been continuity rather than disruption. Architectural innovation established the transformer as a general-purpose sequence model. Scaling laws formalised how performance improves with model size, data, and compute. Alignment techniques modified behaviour without altering the underlying architecture.

Three structural themes emerge from this trajectory.

1. Scale Produces Qualitative Change

Across GPT-2 and GPT-3, increasing scale did not merely improve accuracy on existing tasks; it enabled qualitatively new behaviours such as in-context learning. The training objective remained constant, yet capability shifted. This suggests that certain behaviours emerge once representational capacity crosses specific thresholds.

2. Architectural Generality

The transformer architecture proved effective across language, protein structure prediction, and genomics. The mechanism is domain-agnostic: attention models relationships within sequences, regardless of whether the tokens represent words, amino acids, or base pairs. This generality has contributed significantly to its dominance.

3. Compressed Development Cycles

The interval between major capability shifts has been short. Architectural stabilisation in 2017 was followed by rapid scaling, formal scaling analysis, and alignment layering within a few years. Progress has been driven less by paradigm replacement and more by disciplined scaling and optimisation.

Closing Observation

The defining feature of this decade has not been isolated breakthroughs, but the compounding effect of a stable architecture scaled and behaviourally aligned. The transformer did not need to be replaced; it needed to be scaled, trained efficiently, and behaviourally constrained.

That structural pattern is more informative than any individual model release.

References

Early Sequence Models (1997)

Long Short-Term Memory: Hochreiter, S., & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735–1780. https://www.bioinf.jku.at/publications/older/2604.pdf

Word Embeddings and Sequence Models (2013-2014)

Word2Vec: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/pdf/1301.3781.pdf
GloVe: Pennington, J., Socher, R., & Manning, C. D. (2014). “GloVe: Global Vectors for Word Representation.” https://nlp.stanford.edu/pubs/glove.pdf
Sequence to Sequence Learning: Sutskever, I., Vinyals, O., & Le, Q. V. (2014). “Sequence to Sequence Learning with Neural Networks.” https://arxiv.org/pdf/1409.3215.pdf
Neural Machine Translation with Attention: Bahdanau, D., Cho, K., & Bengio, Y. (2014). “Neural Machine Translation by Jointly Learning to Align and Translate.” https://arxiv.org/pdf/1409.0473.pdf

The Transformer Architecture (2017)

Attention Is All You Need: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention Is All You Need.” https://arxiv.org/pdf/1706.03762.pdf

Early Large Language Models (2018)

GPT-1: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). “Improving Language Understanding by Generative Pre-Training.” https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
BERT: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” https://arxiv.org/pdf/1810.04805.pdf

Scaling Language Models (2019-2020)

GPT-2: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). “Language Models are Unsupervised Multitask Learners.” https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Scaling Laws for Neural Language Models: Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). “Scaling Laws for Neural Language Models.” https://arxiv.org/pdf/2001.08361.pdf
GPT-3: Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). “Language Models are Few-Shot Learners.” https://arxiv.org/pdf/2005.14165.pdf

Transformers Beyond Language (2020)

AlphaFold 2: Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., & Hassabis, D. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature, 596, 583–589. https://www.nature.com/articles/s41586-021-03819-2

Alignment and Safety (2022)

InstructGPT: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). “Training language models to follow instructions with human feedback.” https://arxiv.org/pdf/2203.02155.pdf
Training Compute-Optimal Large Language Models (Chinchilla): Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. van den, Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., & Sifre, L. (2022). “Training Compute-Optimal Large Language Models.” https://arxiv.org/pdf/2203.15556.pdf
Constitutional AI: Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., Mercado, N., DasSarma, N., Lasenby, R., Larson, R., Ringer, S., Johnston, S., Kravec, S., Showk, S. E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S. R., Hatfield-Dodds, Z., Mann, B., Amodei, D., Joseph, N., McCandlish, S., Brown, T., & Kaplan, J. (2022). “Constitutional AI: Harmlessness from AI Feedback.” https://arxiv.org/pdf/2212.08073.pdf

Open Models (2023)

LLaMA: Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). “LLaMA: Open and Efficient Foundation Language Models.” https://arxiv.org/pdf/2302.13971.pdf
LLaMA 2: Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., & Scialom, T. (2023). “Llama 2: Open Foundation and Fine-Tuned Chat Models.” https://arxiv.org/pdf/2307.09288.pdf