Skip to main content

Using AI and machine learning to improve polygenic risk prediction of disease

Primary supervisor

Enes Makalic

Research area

Machine Learning

We are interested in understanding genetic variation among individuals and how it relates to disease. To do this, we study genomic markers or variants called single nucleotide polymorphisms, or SNPs for short. A SNP is a single base position in DNA that varies among human individuals. The Human Genome Project has found that these single letter changes occur are all over the human genomes; each person has about 5M of them!  While most SNPs have no effect, some can influence traits or increase the risk of certain diseases. Understanding SNPs is critical for personalized medicine and genetic research.

To understand the association between SNPs and a disease or trait, researchers conduct genome-wide association studies (GWAS). A GWAS measures millions of genomic markers across the genome in typically a few hundred of a few thousand individuals. The aim of a GWAS is to identify SNPs that are associated with the risk of disease. SNPs associated with disease are used to construct polygenic risk scores (PRS): a weighted sum of the risks from tens to millions of independent disease-associated SNPs from across the genome. The conventional, or gold-standard, approach to analysis of GWAS data is to fit a regression model with each SNP independently, perhaps adjusting for other covariates such as age and sex.

This project will focus on developing and applying novel machine learning and AI methods to improve the construction of PRS and enhance disease prediction. Students will gain experience in:

  • Statistical genetics and GWAS methodology
  • Machine learning approaches for high-dimensional data
  • Algorithm development and evaluation
  • Applications of AI in biomedical research

This project is suitable for students with an interest in genetics, computational biology, or data science. A background in statistics, computer science, or related quantitative fields will be advantageous.

AIM 1: Improve predictive performance of PRS using machine learning and AI.

As performance of PRS can be limited for some conditions, this project will investigate whether PRS developed using standard methods can be improved by using new machine learning and AI algorithms.

AIM 2: Improving cross-ancestry performance of European ancestry-derived PRS.

Since most PRS have been developed in studies of people of European ancestry, the predictive performance for people of non-European ancestry may be reduced the frequency and disease risk of each genetic variant may vary across ethnicities. This project will borrow approaches from machine learning to adapt PRS for use in non-European ancestries.


Learn more about minimum entry requirements.