You are here

A generic metadata management tool for large-scale data-intensive applications

Student: 
Pei Pei
Grade: 
first

Principal goal: to build a generic metadata management tool for supporting large-scale scientific data intensive applications in e-Science research projects.

The emergence of the grid model has made large scale collaborations of researchers in different application domains possible and popular. Large collections of data during distributed collaborations captured by instruments or generated by simulators must be stored into a place for keeping facts about characteristics of data in order to share data across system, organization and sector. So far, there are various types of metadata management tools for dealing with such vast quantities of data in the Grids [1-4]. However, from extensibility and reusability points of view, it is still a problem on how to register /or publish and discover scientific data in a generic way.

Registering/or Publishing data is the process which data sets and their associated metadata are stored and made accessible for user communities. When a dataset is registered, it is registered with a logical unique dataset name that the user can use when referring to the dataset. The metadata management tool will keep the association between the dataset and the files that compose it. Discovering data means the process of identifying data items of interest to the user. For the generic use, in the case, the data sets are generalized into two types including file-level data sets (e.g., file attributes, file size, the ownership etc) and application-level data sets (e.g., domain-oriented, contents-based metadata information).

This project aims to build up a generic metadata management tool based on the developed metadata management design specification created for a project currently running called NanoCMOS. Such a tool will provide functions to allow a scientist to publish results of scientific experiments with associated metadata information such as domain-independent, domain-dependent etc. The scientist can use the tool to annotate the data sets with their own observations and make these annotations available to communities. The tool can also provide a search function to discover data sets based on the value of descriptive attributes rather than requiring them to know the specific names or physical locations of data items.

There are three main specific goals of the implementation as follows:
1) Providing a generic mechanism for annotating or tagging domain-dependent or domain-independent metadata attributes;
2) Supporting simple queries on its contents based on metadata attributes;
3) Testing the functionalities with simple annotations and queries of file-level and application–level metadata attributes.

Project status: 
Finished
Degree level: 
MSc
Background: 
Knowledge of programming in Java; knowledge of Databases and Web Services
Supervisors @ NeSC: 
Subject areas: 
e-Science
Databases
Distributed Systems
Student project type: 
References: 
[1] Meta-data standards http://metadata-standards.org [2] Adrienne Tannenbaum, Metadata Solutions: Using Metamodels, Repositories, XML, and Enterprise Portals to Generate Information on Demand, Addison-Wesley, 2002. ISBN 0-201-71976-2 [3] David Marco, Building and Managing the Meta Data Repository: A Full Lifecycle Guide, Wiley, 2000. ISBN 0-471-35523-2 [4] David C. Hay, Data Model Patterns: A Metadata Map, Morgan Kaufman, 2006. ISBN 0-12-088798-3