CS50AI - Natural language prediciton analysis

Check out the code for this project here

Analyzing NLP Attention Mechanisms with BERT

For this project I built a python program to predict missing words using Googles BERT and then analysed visually how the model understand the context. I made this project for Harvard's CS50’s Introduction to Artificial Intelligence with Python. This was a good way for me to understand natural language processing further.

2023-12-16-natural-language-prediction-analysis credit

Masked Language Modeling

Masked language modelling means that we have a sentence and we MASK some of the words. For example the sentence "I went to the MASK to deposit my money". From this we know the missing word is "bank." We figured this out from the surrounding context words.

I used the Hugging Face transformers library to implement a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model. As BERT is bidirectional it can look at the entire sentence all at once unlike older language models.

I then have a mask.py script where I have the get_mask_token_index function This takes a user's input sentence and uses TensorFlow's tf.where function to find the exact tensor index of the [MASK] token. The script then passes the tokenised text through the neural network and uses tf.math.top_k to output a probability distribution for the top 3 most likely replacement words.

What is "Attention"

We then analyse our prediction. The way that BERT is structured is that each layer has multiple "attention heads". The job of each head is to calculate the weight between every single pair of words in the sentence.

I then use visualize_attentions and get_color_for_attention_score functions to help visualise this output. The script extract the weights and translates them to a 0-255 RGB grayscale tuple. We then use PIL to make 144 separate grayscale grid diagrams for any given sentence. If a pixel is brighter the stronger the attention weight between those two words.

What we discovered

By looking at these diagrams we can see that the model is using the rules of human grammar without being taught them

Layer 7, Head 12 (Nouns and Determiners): From my analysis we can see that these attention heads specifically pay close attention to determiners. For example in the sentences "The cat MASK on the mat" and "The dog MASK on the log" the model clearly showed the nouns ("cat", "dog") heavily attending to the word "the".
Layer 1, Head 5 (Pronouns and Verbs): From the analysis we can see that these heads link pronouns to the vers following them. For example, in the test sentences "He runs MASK the store" and "She runs MASK the park" the pronouns ("He", "She") cast a bright attention weight onto the verb "runs".

BERT here is not actually learning how to understand English. However, seeing that it does show the rules of human grammar is interesting.