Processing large amount of data across a set of nodes in a cluster like EDIM1 requires deploying and running a workflow and a set of processing elements and library across all the nodes.
The complexity of the problem and the size of the data implies that the execution of the workflow is often an exploratory and iterative process.
This requires the possibility of verifying the results of the intermediate steps, and redeploy processing elements if the results are unsatisfactory.
Moreover, some of the steps work exclusively on local data, while other steps require data from other nodes: it is necessary to coordinate data transfer and the execution of these steps in order to minimise idle time, improving the overall execution time.
As a result of trying to identify the interaction of genes during mice embryo development, a python framework that partially satisfy the above criteria has been developed to deal with 350000 images to be processed and correlated.
Ideally, further development and evaluation with other systems could bring the framework to a production level, easily installable and configurable.