Transformers model architecture

400

Left half: Encoder
Right half: Decoder

Both encoder and decoder are composed of a stack of identical layers.
xx

Encoder

Input Embedding

Embeddings generated from a given sequence of text , numbers, or any signals
e.g. “My dog eats fish”

Positional Encoding

Updates the embedding depending on the position of a particular value in a given sequence.
e.g. (not completely valid) “My dog eats fish” “My favourite animal is dog”
In both sentences, the work “dog” has different meaning depending on its position.
Thus position factor must be added in the embeddings

Multi-head Attention

Each word in the given sequence of text will have 3 vectors:

  1. Q (query)
  2. K (key)
  3. V (value)
    The attention layer takes these 3 vectors and generates an attention vector, which (kind of) determines the weightage of that word in overall sentence.

600

The attention vector for each word can be calculated using Scaled-Dot Product Attention.
But instead of performing a single attention function with dimensions, to get benefit of parallelisation, Multi-Head Attention is performed. The attention is calculated in parallel processes. The dimensions for each attention layer is kept ( according to the 1 ) to be , so overall computation performed is same.

Feed Forward

Is just a neural network. a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

Add and Normalisation

Normalises the values , either batchwise or layer wise

600

Decoder

Contains all the layers (except Input Embedding) present in Encoder. Layers are similar. The layers unique to decoder are:

Output Embedding

Output of the previous layer ( layer from the stack)
Output embeddings are offset by one position, also masking is done before Multi-Head Attention layer. This ensures that the predictions for position can depend only on the known outputs at positions less than

Linear and Softmax

Linear layer is again a Feed Forward layer which converts the attention vectors into a vector of dimension suitable . Softmax layers outputs the probabiilties of the next possible word / value in the generated sequence

Further Reading

  1. MQA vs GQA: https://towardsdatascience.com/demystifying-gqa-grouped-query-attention-3fb97b678e4a
  2. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — https://arxiv.org/pdf/2305.13245.pdf
  3. MQA: Fast Transformer Decoding: One Write-Head is All You Need — https://arxiv.org/abs/1911.02150

Footnotes

  1. Reference: Attention is All You Need