You are here

Large-scale data mining of chemical-genetic data sets

Primary objective: to perform data mining on a real-world data set from a biology lab in the School of Biological Sciences with the aim to extract patterns that lead to hypotheses about mode of action of compounds and function of genes.

Functional genomic screens, especially in the budding yeast, generate huge amounts of data. The Tyers Lab ( lab generates chemical-genetic data to test the effect of small molecules on growth of different yeast deletion mutants. This data set combined with data published by other labs is a large enough data set for advanced data mining.

Possibilities for this data analysis range from different clustering algorithms to pattern matching and association analysis. It is also possible to include structural similarity calculation between compounds. This should lead to a definition of a set of chemical-genetic signatures that are associated with specific effects on eukaryotic cells (like novel detoxification pathways). Also, the biologists in the lab are looking for new hypotheses about the mode of action for compounds and about the function of yeast genes that are as yet uncharacterized (up to 1000 of the 6000 yeast genes are still uncharacterized).

This project expects from you:
- To identify the data mining procedure and algorithms suitable to extract patterns from these data.
- To identify solutions to handle the large amount of data (distributed computing paradigms such as MapReduce)
- To develop the data mining workflow using existing or new implementations
- To deliver this workflow in a way that they can interact and use it
as tool after the project.

Project status: 
Degree level: 
Data mining / machine learning / data exploration essential. Distributed computing a major advantage. Experience with biology/bioinformatics desirable, but not essential as you can lean on the biologists' expertise.
Supervisors @ NeSC: 
Other supervisors: 
Jan Wildenhain, Tyers Lab, School of Biological Sciences ( Michaela Spitzer, Tyers Lab, School of Biological Sciences
Subject areas: 
Distributed Systems
Machine Learning/Neural Networks/Connectionist Computing
WWW Tools and Programming
Student project type: