Our research throws up many questions that we cannot address immediately. Some of these are listed below. We would be delighted to hear from others who would like to join us in tackling them or already have the answers.
In EFFORT project and in others, there are difference kind of jobs that must be submitted to a computational resource. Due of the characteristics of the job, sometimes the best computational resource could be EDIM1 (in case the job requires work with a huge volume of data), sometimes could be a typical cluster like ECDF (in case the job requires high performance computing), and other will be enough send the job to the esciences1-8 machines (for small and quick computation).
Many applications use collective I/O operations to read/write data from/to disk. One of the most used is the Two-Phase I/O technique extended by Thakur and Choudhary in ROMIO. Two-Phase I/O takes place in two phases: redistributed data exchange and an I/O phase. In the first phase, by means of communication, small file requests are grouped into larger ones. In the second phase, contiguous transfers are performed to or from the file system.
Message Passing Interface (MPI) is the message-passing library most widely used to provide communications in clusters. There are several MPI implementations like MPICH, CHIMP, LAM, OPEN MPI, etc. We have developed a library called PRAcTICaL-MPI (PoRtable AdpaTIve Compression Library) that reduces the data volume by using loss-less compression among processes.
DISPEL is a language designed for describing and organising data-intensive processing.
Cloud systems, such as OSDC and Microsoft's Azure are intended to provide easily accessed and economic data-intensive computation.
The challenge is that DISPEL is a streaming technology that potentially can handle large volumes of data as well as continuous streams of data.
This streaming needs computational nodes that can access disks and that can communicate with one another, e.g. stream data to one another.
Brain images are used in a variety of multi-disciplinary studies including: medicine, psychology, linguistics,...
They used a range of image types generated with different equipment: PET, SPECT, EEG, MEG, MR, CT and
using different parameters that produce very different data sizes and number of images.
The EDIM1 data-intensive architecture is intended to accelerate data-intensive processing.
One good way to organise this processing is with a map-reduce model.
We can consider two candidate implementations: Hadoop and the Spectre+Sphere combination from Grossman et al.
Processing large amount of data across a set of nodes in a cluster like EDIM1 requires deploying and running a workflow and a set of processing elements and library across all the nodes.
The complexity of the problem and the size of the data implies that the execution of the workflow is often an exploratory and iterative process.
Scientific laboratories produce large amounts of data, often stored as files in hierarchical folders. File systems do not scale well with large number of files. In particular, access to data becomes hard if query criteria do not match storage criteria.
Data protection is a great concern when dealing with medical data because it contains sensitive personal information.
Nevertheless, medical research could greatly profit from researchers being able to share data across institutional borders in
a safe way. There is a trade-off between privacy protection and research interests avoiding to extreme data removals or
Data streaming is a strategy for scalable or continuous data processing. We have developed a high-level notation for describing distributed and heterogeneous data-streaming workflows called DISPEL and have a substantial body of applications described in DISPEL. An implementation based on OGSA-DAI exists and at least two other implementations are partially constructed. The Open Questions that need investigating via a series of experiments are: