Wednesday, December 27, 2023

Terms in LLM architecture Part 2 - Multi Head attention and masked multi head attention mechanism

Multi Head Attention 

It is the key to Transformer based architecture. Transformer model is a deep learning model introduced in the paper "Attention is all you need". Multi head attention enables the model to focus on different parts of the input sequence simultaneously, capturing diverse types of information. It is used in NLP tasks such as machine translation and language modelling. 

The formula of Multi Head attention involves linear projection, scaling , dot-products and softmax operations. 

The key concepts associated with Multi head attention are: 

Attention Mechanism: 

This allows model to selectively focus on different parts of the input sequence when making predictions. Instead of treating words equally, attention mechanism assign varying level of importance to different words. 

Single Head Attention 

In this, the model learns single set of attention weights  for each position in the input sequence. This means the model attends to different positions one at a time. 

Multi Head Attention: 

This extends the idea of attention by using multiple heads (sets of attention weights) in parallel. Each head operates independently on the input sequence and produces own output. The output from different heads are then combined into 

Parameter sharing: 

In multi head attention, the model learns separate set of parameters (weights) for each attention head. However the computational cost can be controlled by using parameter sharing where the same set of weights is applied to input sequence across all attention head. 

Masked Multi Head Attention 

It is a variant of Multi Head Attention used in the transformer architecture. The masking is introduced to prevent attending to the future positions during training, particularly in the context of autoregressive language modelling or sequence to sequence tasks. This is done to ensure that the model attends to only positions that have already been generated. 

In the context of language modelling or sequence generation, where the model generates one token at a time in an autoregressive manner, it is crucial to prevent information leakage from future positions. Without masking, the model could inadvertently attend to tokens that it has not yet generated during training, leading to sub optimal results.  


references:

OpenAI 


No comments:

Post a Comment