Back to Feed
ProjectsOther

CS50AI - Sentence parser

Check out the code for this project here

Building a Sentence Parser in Python

For this project I built a Python script that automatically parses sentences and extracts their core components. I made this project for Harvard's CS50’s Introduction to Artificial Intelligence with Python. This was a good introduction to Natural Language Processing (NLP).

2023-12-16-sentence-parser credit

Context-Free Grammar (CFG)

So that the computer can extract the meaning of a sentence we need to teach it grammatical structure which we can do by teaching it Context-Free Grammar (CFG).

A CFG means a set of hardcoded rules that define how we generate a language. Here you break the language down into two categories:

  • Terminals: The actual vocabulary words.

  • Non-terminals: The grammatical structures (e.g., Noun N, Verb V, Adjective Phrase AP).

To ensure that the AI understands how sentences are structures we have to define the hierarchical rewrite rules of English in my NONTERMINALS string. As an example in the script we define a Noun Phrase (NP) as something that could be a standalone Noun, a Determiner plus a Noun, or even a phrase containing an Adjective Phrase (NP -> Det AP N). By giving the program these rules we allow the AI to dissect the text.

NLTK and Preprocessing

My script uses pythons Natural Language Toolkit (nltk). Firstly we have a preprocess function that cleans the data. It does this by tokenising the word using nltk.word_tokenize. This converts everything to lowercase and removes the punctuation.

The nltk parser then takes the CFG rules we made previously and works backward. The program then tries to piece the individual word tokens together into larger phrases. If the sentence is grammatically valid according to the CFG rules we generate a syntactic tree.

Noun Phrase Chunking

After we have our syntactic tree the program them does Noun Phrase Chunking where we identify the noun phrases in the sentences. Noun phrases represent what the sentence is about well. If an AI reads something it needs to find the noun phrases to figure out what the text is on about.

To be able to do this we have a np_chunk function that converts the standard tree into an nltk.tree.ParentedTree. This then allows the script to traverse the data structure and find every normal noun. Then it extracts the parent node of the noun and returns the complete Noun Phrase chunk.