Transformers

A. Introduction to the Transformer

The transformer,is the standard architecture behind today’s large language models (LLMs). Transformers have completely changed the way we do speech and language processing, and every chapter that follows will build on this idea.

What is a Transformer?

A transformer is a neural network with a special structure that uses self-attention (or multi-head attention).

Attention is a way for the model to decide which other words (tokens) are important when figuring out the meaning of the current word.
This allows the model to capture relationships between tokens even when they are far apart in the sentence.

Transformer Architecture

The figure below shows the main components of the transformer:

The transformer has three main parts:

1. Input Encoding

Each input token (like the word thanks) is turned into a vector using an embedding matrix (E).
The model also adds position information, so it knows the order of tokens.

2. Transformer Blocks (the core of the model)

These are stacked in columns, often 12–96 layers deep.
Each block contains:
- Multi-head attention (looks at other tokens for context)
- Feedforward networks (process the information further)
- Layer normalization (keeps training stable)
Together, the blocks transform the sequence of input vectors (x₁, …, x_n) into output vectors (h₁, …, h_n).

3. Language Modeling Head

After the last block, the model produces predictions.
Each final vector is passed through an unembedding matrix (U) and a softmax function to choose the most likely next token.

Summary

In short: a transformer is a neural network that predicts tokens one by one. It works by first turning tokens into vectors, then processing them through stacks of attention-based layers, and finally predicting the next token using a softmax output head.

B. Attention

In older models like word2vec, each word always had one fixed meaning. For example:

The word “chicken” always had the same vector, no matter if we meant the animal or the food.
The word “it” always had the same vector, even though in different sentences it can mean different things.

But in real life, the meaning of a word depends on context (the other words around it). That’s what transformers + attention solve.

🐔 Example 1: "it" with different meanings

The chicken didn’t cross the road because it was too tired. → Here, it = chicken.
The chicken didn’t cross the road because it was too wide. → Here, it = road.

👉 The word “it” changes its meaning depending on what came before. Transformers figure this out using attention: they look at all the words in the sentence and decide which ones are important for understanding “it”.

Attention is the mechanism in the transformer that weighs and combines the representations from appropriate other tokens in the context from layer k+1 to build the representation for tokens in layer k.

Attention diagram — "it" attends to earlier tokens

The diagram above shows how the token it (upper row) assigns attention to earlier tokens. Thickness of the arrow indicates attention weight (stronger → thicker). Here chicken receives the highest attention (0.60), road gets some attention (0.30), and other nearby tokens receive small weights.

⏳ Example 2: Reading step by step

When reading left to right:

The chicken didn’t cross the road because it...

At this moment, we don’t yet know if “it” refers to chicken or road. So the model may pay attention to both until the next word (“tired” or “wide”) makes it clear.

🔑 Example 3: Grammar & Meaning

The keys to the cabinet are on the table.
Subject = keys (plural).
Verb = are (plural).
Attention helps the model connect keys with are, even though cabinet is closer.
I walked along the pond, and noticed one of the trees along the bank.
“bank” here means river bank, not financial bank.
The model knows this because of nearby words like pond and trees.

🎯 How Attention Works

At each layer of the transformer:

The model takes a word (say “it”)
Looks at all other words in the sentence
Assigns attention weights (higher = more important)
Builds a new contextual meaning for “it”

In the diagram, the word “it” attends strongly to chicken and road because these are the most likely references.

✅ In short:

Transformers don’t just see a word → they look at all other words to guess its meaning.
That’s why “it” can mean chicken in one case and road in another.
This is called self-attention: each word attends to others in the sentence.

What attention computes

“When I’m looking at the current word, which earlier words should I pay more attention to, and by how much?”

It does this by giving weights (importance values) to all the earlier words and then combining them into a new representation.

At a given layer of a transformer, attention builds a new vector representation for each token by selectively combining information from earlier tokens. For token position i, attention takes the current representation x_i and a set of prior representations x₁, …, x_i and produces a new vector a_i that summarizes the context most relevant to x_i.

In causal (left-to-right) language models the context for position i is the tokens up to and including position i (no future tokens). Attention is computed independently at every position, so a self-attention layer maps the full input sequence (x₁,…,xₙ) to an output sequence (a₁,…,aₙ) of the same length.

Intuition

Attention answers the question: “Given the current token, which earlier tokens are most useful for understanding it?” It does so by (1) scoring how similar each earlier token is to the current token, (2) turning those scores into normalized weights (probabilities), and (3) computing a weighted sum of the earlier token vectors using those weights.

Step-by-step — the simplified attention formula

Step 1 — inputs

We assume token vectors x₁, x₂, …, xₙ. For the current position i we will produce a_i, the attention output for position i.

Step 2 — raw similarity scores (dot product)

For each earlier position j ≤ i compute a scalar score measuring similarity between x_i and x_j. The simplest choice is the dot product:

score(x_i, x_j) = x_i · x_j

Intuition: if two vectors point in similar directions, their dot product is large — that suggests the earlier token x_j is relevant to x_i.

Step 3 — normalize scores to weights (softmax)

Raw scores can be any real numbers. We convert them into a probability distribution over the earlier tokens so the values become interpretable as "how much to use each token". This is done with the softmax:

a_{ij} = softmax_j( score(x_i, x_j) )  for j ≤ i

The result a_ij is nonnegative and the weights for all j ≤ i sum to 1. Typically the weight on j = i (the token itself) is large, but other tokens may also receive substantial weight if they are similar to x_i.

Step 4 — weighted sum

Use the weights to combine the earlier vectors into the attention output:

a_i = Σ_{j ≤ i} a_{ij} · x_j

In words: multiply each prior vector x_j by its attention weight a_ij and add them up. The result a_i is a new contextualized vector for position i.

Why these steps make sense

Dot product → similarity: vectors that are similar point in similar directions; the dot product is a direct measure of that.
Softmax → normalized importance: exponent + normalization produces positive weights that sum to one, so the model forms a convex combination (a weighted average) of prior vectors.
Weighted sum → contextual vector: the output a_i is a blended vector that pulls information from the tokens that matter most for the current position.

In language, this lets a token's representation incorporate information from words that might be far away in the sentence: pronoun resolution, agreement (subject ↔ verb), and disambiguating word senses are all examples where attention helps.

Summary

Attention converts raw similarity scores between the current token and each prior token into a probability distribution (softmax), then builds a context-aware vector as the weighted sum of prior vectors. This simple mechanism—score, normalize, mix—repeats at every layer and every position, and is the key building block that lets transformers form rich contextualized token embeddings.

Attention — A Step-by-Step Example

Let’s carefully walk through how self-attention works inside a transformer. We’ll use a short sentence: “The cat sat”. Our focus will be on the token “sat” (x₃), and how the model builds its new representation a₃ by looking back at earlier tokens (x₁ = “the”, x₂ = “cat”, and itself).

This diagram shows one token (current xᵢ = sat) attending to earlier tokens x₁, x₂, x₃. Raw dot-product scores (1.0, 3.0, 2.0) are normalized by softmax to weights (~0.09, 0.67, 0.24), then the output aᵢ is the weighted sum of the earlier vectors.

Step 1 — Compare with earlier tokens

The current word “sat” (x₃) is compared with each token before it (including itself). These comparisons are done using the dot product. For our example, the similarity scores come out as:

x₃ with x₁ (“the”): 1.0 → weak similarity
x₃ with x₂ (“cat”): 3.0 → strong similarity
x₃ with x₃ (“sat”): 2.0 → medium similarity

Step 2 — Convert scores into probabilities

Raw scores can be large or negative, so we normalize them using the softmax function. This turns the scores into probabilities that always add up to 1:

    softmax([1.0, 3.0, 2.0]) → [0.09, 0.67, 0.24]

Meaning: the word “sat” attends mostly to “cat” (67%), some to itself (24%), and very little to “the” (9%).

Step 3 — Build the new representation

Finally, we compute the new output vector a₃ by taking a weighted sum of all the inputs:

    a₃ = 0.09·x₁ + 0.67·x₂ + 0.24·x₃

This means the new meaning of “sat” now strongly includes information about “cat” (the subject), making it easier for the model to understand who did the action.

✅ Key takeaway

Attention doesn’t just look at the current word in isolation. Instead, it asks: “Which earlier words matter most for understanding this one?”. The answer is encoded in the attention weights, and the new vector aᵢ is built from a blend of those important words.

Understanding a Single Attention Head

The concept of attention lies at the very heart of the Transformer architecture. In this chapter, we will explore step by step how a single attention head works. Rather than thinking of it as an abstract equation, we will break it down into an intuitive process. Imagine that every word in a sentence has the ability to ask: “Whom should I pay attention to, and by how much?”

Step 1: Input Representation

We begin with an input embedding for each token in the sequence. If the model dimension is d, each token is represented as a vector:

x_i ∈ ℝ^{1 × d}

Here, x_i is the embedding of the i-th word.

Step 2: Creating Queries, Keys, and Values

From each input vector, the model creates three different projections: a Query (Q), a Key (K), and a Value (V). These are obtained by multiplying the input with learned weight matrices:

q_i = x_i W_Q, k_i = x_i W_K, v_i = x_i W_V

The role of these projections is to allow the same word to play different parts in the attention process: the query asks the question, the key provides the address to be matched, and the value carries the information to be retrieved.

Step 3: Matching Queries with Keys

Once we have queries and keys, we can measure how strongly a word should attend to others. This is done by taking the dot product between a query and a key, scaled by the dimension of the key vectors:

score(x_i, x_j) = (q_i · k_j) / √d_k

If the dot product is large, the query and key are similar, which means the word x_i should pay closer attention to x_j.

Step 4: From Scores to Attention Weights

These raw scores are then normalized using the softmax function. This ensures that the attention weights form a probability distribution:

a_ij = softmax(score(x_i, x_j))

Attention is a key concept in modern deep learning architectures such as the Transformer. At its core, the attention mechanism allows each token in a sequence to look at, or “attend to,” other tokens in order to build a more context-aware representation.

Let us break down the process step by step. Suppose we have an input sequence of tokens. Each token x_i is first projected into three different spaces: a Query (Q), a Key (K), and a Value (V). These are obtained through learned linear transformations.

The Query of the current token is compared with the Keys of all tokens. This comparison is done by taking dot products, which produce similarity scores. These scores measure how much focus should be given to each token relative to the current one.

The similarity scores are then normalized using the softmax function, converting them into attention weights. These weights represent how strongly each token contributes to the final representation of the current token.

Using these attention weights, we take a weighted sum of the Value vectors. This results in a new vector representation for the current token — one that incorporates contextual information from the entire sequence.

Finally, this output is passed through another learned linear transformation, represented by W_O, to bring it back into the same dimension as the input embeddings.

Summary — Single Attention Head

Each token x_i creates three versions of itself: Query (Q), Key (K), and Value (V).

Compare: The query of the current token is compared with the keys of all tokens to produce similarity scores.

Normalize: The raw similarity scores are converted into attention weights using a softmax, so they form a probability distribution (weights sum to 1).

Mix: These attention weights are used to compute a weighted sum of the value vectors, producing a new contextual representation (the attended output).

Project back: The head output is reshaped back to the model dimension using the output matrix W_O, so the final output has the same size as the original token vector.

In compact form: each x_i → (Q,K,V) → scores via Q·K → softmax → weights → weighted sum of V → final projection with W_O.

Multi-Head Attention

Intuition

Multi-head attention enriches the model’s representation by allowing it to examine the context from several different perspectives simultaneously. While one head may align strongly with semantically related words, another may emphasize positional structure, and another may attend to rare but important connections. The combination of all these heads, followed by a projection back to the model space, gives the transformer both breadth and depth in capturing dependencies across a sequence.

A transformer does not rely on a single attention mechanism. Instead, it uses multiple parallel attention heads within the same layer. The idea is that each head can focus on a different aspect of the context. One head may capture short-range dependencies, another may focus on long-distance relationships, while yet another may specialize in syntactic or semantic cues. By combining these heads, the model gains a richer and more flexible representation of the input sequence.

Head-Specific Projections

Each attention head has its own set of learnable parameters. Given an input vector xᵢ at position i, the model projects it into separate query, key, and value vectors for each head. For head c, this is written as:

qᶜᵢ = xᵢ · W_Qᶜ
kᶜⱼ = xⱼ · W_Kᶜ
vᶜⱼ = xⱼ · W_Vᶜ

Here, W_Qᶜ, W_Kᶜ, and W_Vᶜ are parameter matrices of dimensions d × d_k, d × d_k, and d × d_v respectively. This means every head learns its own mapping from the model dimension d into smaller subspaces of size d_k and d_v.

Attention Within a Head

Once queries and keys are defined, the similarity between a query at position i and a key at position j is measured using the dot product. To stabilize gradients, this score is scaled by the square root of d_k:

scoreᶜ(xᵢ, xⱼ) = (qᶜᵢ · kᶜⱼ) / √(d_k)

These scores are then normalized with a softmax across all context positions to obtain attention weights:

aᶜ_{i j} = softmax(scoreᶜ(xᵢ, xⱼ))

Finally, the output of head c for position i is a weighted sum of the value vectors:

headᶜᵢ = Σⱼ aᶜ_{i j} · vᶜⱼ

Combining Multiple Heads

Each head produces an output vector of size 1 × d_v. If there are A heads, their results are concatenated to form a vector of size 1 × (A·d_v). This combined vector is then projected back into the model dimension d using an additional matrix W_O:

aᵢ = ( head¹ᵢ ⊕ head²ᵢ ⊕ … ⊕ headᴬᵢ ) · W_O

Here, ⊕ denotes concatenation. The matrix W_O has dimensions (A·d_v) × d, ensuring the final multi-head output for each position returns to the expected model dimension.

C. Transformer Block Overview

A transformer block is a modular unit used repeatedly in transformer models. Each block transforms the d-dimensional vector for one token in two main ways:

by letting the token attend to other tokens (self-attention),
and by applying a position-wise feedforward network (FFN) to each token.

Complementing these are residual connections and layer normalization, which stabilize training and preserve information.

1. Residual Stream

Imagine a vertical pipe carrying a vector for a single token upward through the transformer's layers. Each component reads the current vector in the pipe, computes an output, and adds that output back into the pipe. This "residual stream" view emphasizes that information accumulates rather than being overwritten.

Why residuals?

Residual (skip) connections help gradients flow backward during training and guarantee the input signal remains available at every stage. Practically, they prevent the network from forgetting the original token embedding as deeper transforms are applied.

Figure: Simplified transformer block showing residual additions.

2. Feedforward Network (FFN)

The feedforward network is applied independently to each token. It does not mix information across positions — that is the job of attention. The FFN is identical across all positions but varies between layers (each layer has its own learned weights).

2.1 Mathematical form

The FFN for token x_i is typically written as:

FFN(x_i) = ReLU(x_i W_1 + b_1) W_2 + b_2

Here, W_1 maps the d-dimensional token vector to a wider hidden dimension d_ff, and W_2 maps back to the original dimension d. Commonly, d_ff > d (for example, 2048 vs 512). The intermediate ReLU (or GELU) gives the network nonlinearity.

2.2 Intuition

The FFN expands the representation into a higher-dimensional space where complex features can be computed, then compresses it back into the model dimension. Think of it as a per-token "processor" that performs richer transformations on that token's information after it has gathered context from attention.

3. Layer Normalization (LayerNorm)

LayerNorm standardizes the components of a single token vector so they have zero mean and unit variance, with learnable scale and shift. It is applied twice in each transformer block: once before attention and once before the feedforward network.

Step-by-step computation

Mean: μ = (1+2+3+4)/4 = 2.5.
Variance: ((1−2.5)² + (2−2.5)² + (3−2.5)² + (4−2.5)²)/4 = 1.25.
Standard Deviation (Std): σ = √1.25 ≈ 1.1180.
Normalized vector: x̂ ≈ [−1.3416, −0.4472, 0.4472, 1.3416].
Finally, add two learnable parameters:
- γ (scaling factor, or “gain”)
- β (shifting factor, or “offset”)
The final formula is:
LayerNorm(x) = γ · x̂ + β
This keeps the vector centered around 0 with unit variance, but also allows the network to adjust the scale and shift using γ and β.

Note: LayerNorm operates across the features of a single token (the d components), not across tokens in a batch. This differs from BatchNorm, which normalizes across a batch dimension.

4. Full transformer block — step-by-step equations

Below is the canonical sequence of operations for a single token x_i inside one transformer block:

t1_i = LayerNorm(x_i)
t2_i = MultiHeadAttention(t1_i, [t1_1, ..., t1_N])
t3_i = t2_i + x_i
t4_i = LayerNorm(t3_i)
t5_i = FFN(t4_i)
h_i  = t5_i + t3_i

- MultiHeadAttention(·) is the component that mixes information across tokens (every token's t1 vectors are available to the attention mechanism). - The two additions (+ x_i and + t3_i) are residual connections that preserve earlier representations.

5. Where does cross-token information enter?

Only the attention mechanism reads other token streams. Attention pulls information from other residual streams (other token positions) and writes its result back into the current token's stream via the residual addition. FFN and LayerNorm act only on the local token vector.

Elhage et al. (residual movement)

In their analysis, attention heads can be viewed as literally moving pieces of information from one token's residual stream to another token's stream. This means the final vector at a position can contain subspaces encoding other tokens' content.

Figure: An attention head can move information from token A’s residual stream into token B’s residual stream..

6. Stacking blocks — building deep models

Transformer models are created by stacking many identical blocks. Because each block's input and output dimensions match (both are d), stacking is straightforward. Typical layer counts:

Model	Typical layers
T5 / GPT-3 small	~12
GPT-3 large	~96
Modern LLMs (varies)	100s or more

At shallow layers, the residual stream mostly represents the current token. At deeper layers, the stream often encodes information useful for predicting the next token — this is a result of training objectives such as next-token prediction.

7. Practical notes and variations

Activation function: Many modern implementations use GELU instead of ReLU inside the FFN.
Pre-LN vs Post-LN: The textbook equations above represent pre-layer-norm (LayerNorm before attention/FFN). Some older descriptions use post-layer-norm (LayerNorm after each residual add). Pre-LN tends to be more stable in deep networks.
Normalization details: Implementations add a small epsilon to the denominator when dividing by \sigma to avoid numerical issues.

8. Summary

Each transformer block:

normalizes the token vector,
allows the token to gather context via multi-head attention,
adds that context back using a residual connection,
normalizes again,
applies a position-wise feedforward network (FFN),
and adds the FFN output back via another residual connection.

Stack many such blocks and you have the deep transformer architectures used for tasks in NLP, vision, and beyond.

9. Parallelizing Transformer Computations with Matrices

So far, we have described the transformer block as if it were computing the output for one token at a time. For example, in self-attention, we showed how a single token vector \(x_i\) produces a query, key, and value, and how these interact with other tokens. But in reality, transformers do not process tokens one by one. Instead, they make use of parallel computation, which allows them to handle all tokens at once. This parallelism is one of the main reasons transformers are so efficient and scalable.

1. Representing the Input as a Matrix

Imagine we have an input sequence with \(N\) tokens. Each token is represented by an embedding vector of dimension \(d\). Instead of treating each token separately, we can stack all token vectors into a single matrix:

\[ X \in \mathbb{R}^{N \times d} \]

Here:

\(N\) = sequence length (number of tokens).
\(d\) = embedding dimension (e.g., 512 or 1024 in practice).
Each row of \(X\) corresponds to one token's embedding.

For example, if we have a sentence with 4 tokens and embeddings of size 3, our matrix looks like:

\[ X = \begin{bmatrix} x_{1,1} & x_{1,2} & x_{1,3} \\ x_{2,1} & x_{2,2} & x_{2,3} \\ x_{3,1} & x_{3,2} & x_{3,3} \\ x_{4,1} & x_{4,2} & x_{4,3} \end{bmatrix} \]

This matrix representation lets us apply powerful matrix multiplication routines to all tokens at once, instead of looping through them individually.

2. Parallelizing Attention (Single Head)

Recall that for each token we compute three vectors: Query (Q), Key (K), and Value (V).

For a single token, the computation is:

\[ q_i = x_i W_Q, \quad k_i = x_i W_K, \quad v_i = x_i W_V \]

Instead of computing these for each token separately, we multiply the entire matrix \(X\) by the projection matrices:

\[ Q = X W_Q, \quad K = X W_K, \quad V = X W_V \]

Now:

\(Q, K \in \mathbb{R}^{N \times d_k}\)
\(V \in \mathbb{R}^{N \times d_v}\)

Each row of \(Q, K, V\) corresponds to the query, key, or value of one token.

3. Computing All Attention Scores

To know how much each token should attend to others, we compare queries and keys. For one token, this is just a dot product:

\[ \text{score}(i, j) = \frac{q_i \cdot k_j}{\sqrt{d_k}} \]

But in parallel, we can compute all scores at once using:

\[ QK^\top \in \mathbb{R}^{N \times N} \]

This gives us a full attention score matrix, where entry \((i, j)\) is the similarity between token \(i\)'s query and token \(j\)'s key.

Diagram of MHAttentionParallel — Figure: The N × N QKᵀ matrix showing how it computes all qi · k j comparisons in a single matrix multiple.

A. Introduction to the Transformer

What is a Transformer?

Transformer Architecture

1. Input Encoding

2. Transformer Blocks (the core of the model)

3. Language Modeling Head

Summary

B. Attention

🐔 Example 1: "it" with different meanings

Attention diagram — "it" attends to earlier tokens

⏳ Example 2: Reading step by step

🔑 Example 3: Grammar & Meaning

🎯 How Attention Works

✅ In short:

What attention computes

Intuition

Step-by-step — the simplified attention formula

Step 1 — inputs

Step 2 — raw similarity scores (dot product)

Step 3 — normalize scores to weights (softmax)

Step 4 — weighted sum

Why these steps make sense

Summary

Attention — A Step-by-Step Example

Step 1 — Compare with earlier tokens

Step 2 — Convert scores into probabilities

Step 3 — Build the new representation

✅ Key takeaway

Understanding a Single Attention Head

Step 1: Input Representation

Step 2: Creating Queries, Keys, and Values

Step 3: Matching Queries with Keys

Step 4: From Scores to Attention Weights

Summary — Single Attention Head

Multi-Head Attention

Intuition

Head-Specific Projections

Attention Within a Head

Combining Multiple Heads

C. Transformer Block Overview

1. Residual Stream

Why residuals?

2. Feedforward Network (FFN)

2.1 Mathematical form

2.2 Intuition

3. Layer Normalization (LayerNorm)

Step-by-step computation

4. Full transformer block — step-by-step equations

5. Where does cross-token information enter?

Elhage et al. (residual movement)

6. Stacking blocks — building deep models

7. Practical notes and variations

8. Summary

9. Parallelizing Transformer Computations with Matrices

1. Representing the Input as a Matrix

2. Parallelizing Attention (Single Head)

3. Computing All Attention Scores

4. Masking Future Tokens

5. Producing the Weighted Sum

6. Multi-Head Attention

7. Adding Feedforward and Residual Connections

8. Why Parallelization Matters

D. The Language Modeling Head in Transformers

1. Introduction

2. Language Models as Word Predictors

3. Architecture of the Language Modeling Head

3.1 Input Representation

3.2 Linear Projection

3.3 Weight Tying and the Unembedding Layer

3.4 Softmax to Probabilities

4. Applications of the LM Head

4.1 Sequence Probability

4.2 Text Generation

5. The Logit Lens

6. Decoder-Only Transformers

7. Summary