Back to Feed
ProjectsOther

CS50AI - "Gene predictor"

Check out the code for this project here

Building a Bayesian Gene Predictor

In this project instead of dealing with clear rules as I have in some of my other Ai projects here I am dealing with probabilites and uncertainty. This project is an Ai that predicts the likelihood of genetic inheritance. I made this project as part of Harvard's CS50’s Introduction to Artificial Intelligence with Python. I had to calculate the probability of a person posessing a specific genetic mutation based on their family tree.

2023-12-02-gene-predictor credit

The Architecture

People inherit one could inherit one copy of a mutate gene from their mother and or one from their father. This means they can have 0, 1, or 2 copies of a mutated gene. In the case of this problem we don't know the persons exact genetic makeup we only know what physical traits their parents exhibit.

Here I modelled the family tree as a Bayesian Network. I stored these probabilities in a dictionary which contains:

  • Unconditional Probabilities: The default chance of anyone having the gene (if we don't know their parents).

  • Conditional Probabilities: The chance of exhibiting the physical trait given the number of mutated genes they possess.

  • Mutation Rate: A flat 1% chance (0.01) that a gene mutates or repairs itself during inheritance.

Joint Probabilities and Mutations

To find the probabilities for the family the program must evaulate all the possible genetic combinations for every person. In each possible case the program calculates the joint probability which is the likelihood of that scenario happening across the entire family.

The joint probability of a specific family state is the product of all the probabilities of the state of each individual given their parents states. We can represent this as:

P(X1,X2,,Xn)=i=1nP(XiParents(Xi))P(X_1, X_2, \dots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Parents}(X_i))

I calculate the above with my joint_probability function. This function loops through every person and isolates their mum and dad. The hardest part of this function is dealing with the mutation rate. I wrote some logic to to calculate mum_probability and dad_probability. If a parent has 0 copies of the mutate gene we still ensure the child has a 1% chance of inheriting a random mutation.

Normalisation

As we don't know the actual genetics of the family members the main function loops through a powerset of all possible combinations. Then it immediately filters out any cases that contradict the CSV data.

I then have an update function that then takes the joint probabilities of the cases that weren't filtered out and adds them to a running total for each person. Then I have a normalize function that takes these raw probability scores and adds them all together and then scales them by dividing each value by the total sum. This ensures the probabilities of each person's genes (0, 1, or 2) add up to exactly 1.01.0.