You are here

Accelerating data intensive applications using MapReduce

Hwee Yong Ong

Principal goal: by a way of real case study in the Life Science, the goals of this project include: 1) Understanding data-parallel processing using MapReduce model for addressing Performance issues in data intensive applications; 2) Investigating how to adapt data mining algorihtms to the MapReduce model; 3) Prototyping and comparing performance with other frameworks that support data intensive applications.

Performance is an open issue in data intensive applications, e.g., distributed data mining and integration. Recent MapReduce programming model [1] has become a popular paradigm due to its simplicity and scalability at low cost. It can easily parallelise data over large-scale data centres with thousands of computing nodes and process data on terabyte and petabyte scales and thereby improve system performance. The MapReduce model was originally developed by Google. The MapReduce model provides a simple interface of two functions and allows developers to parallelise data processing tasks. Map function performs grouping that produces intermediate data sets and reduce function performs aggregation that aggregates intermediate data sets into smaller data sets. This project will apply the MapReduce model to a real data mining use case in the Life Science EURExpress-II [2, 3] that aims to automatically annotate anatomical components in an image with corresponding terminologies stored in an ontology database. Performance will be evaluated by a comparison study between the prototype of this project and the prototypes of the ADMIRE project [4] that is conducting research into architectures for large-scale and long-running data-intensive computations. Through this project, a student will be able to learn knowledge from through levels: 1)At the conceptual level, understanding the conception of data parallel frameworks for supporting large-scale data mining and integration applications. 2)From an algorithmic perspective, investigating the adaptation of data mining algorithms to the MapReduce model. 3)From a practical point of view, gaining practical programming skills via the architectural implementation and being able to thinking critically by a comparison study.

Project status: 
Degree level: 
Knowledge of programming in Java; Database, Data mining and integration, and distributed computing.
Supervisors @ NeSC: 
Subject areas: 
Algorithm Design
Computer Architecture
Computer Communication/Networking
Distributed Systems
Parallel Programming
Student project type: 
* [1]J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on large clusters, in: In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), 2004, pp. 137–150. * [2]L. Han, J. I. van Hemert, R. Baldock, M. Atkinson, Automating gene expression annotation for mouse embryo, in: R. H. et al. (Ed.), Lecture Notes in Computer Science (Advanced Data Mining and Applications, ADMA 2009), Vol. LANI 5678, 2009, pp. 469–478. * [3]EURExpress-II, * [4] ADMIRE,