You are here

Locality Aware Two Phase I/O


This strategy targets the reduction of the bottleneck in the I/O subsystem. Many applications use collective I/O operations to read/write data from/to disk. One of the most used is the Two-Phase I/O technique extended by Thakur and Choudhary in ROMIO.

Two-Phase I/O takes place in two phases: a redistributed data exchange and an I/O phase. In the first phase, by means of communication, small file requests are grouped into larger ones. In the second phase, contiguous transfers are performed to or from the file system. Before that, Two-Phase I/O divides the file into equal contiguous parts (called File Domains (FD)), and assigns each FD to a configurable number of compute nodes, called aggregators. Each
aggregator is responsible for aggregating all the data, which it maps inside its assigned FD, and for transferring the FD to or from the file system. In the default implementation of Two-Phase I/O the assignment of each aggregator (aggregator pattern) is fixed, independent of distribution of data over the processes. This fixed aggregator pattern might create a I/O bottleneck, as a consequence of the multiple requests performed by aggregators to collect all data assigned to their FD. This bottleneck is still higher in commodity clusters, where commercial networks are usually installed, and in Multi-Core clusters where the I/O bus is shared among the cores of a single node. Therefore we replace the rigid assignment of aggregators over the processes by new one, which decides the aggregator pattern based on two aggregation-criteria:

*1.1 Reduce the number of communications: This criteria assigns each aggregator to the node who has more highest number of contiguous data blocks of the file domain associated with the aggregator.

*1.2 Reduce the volume of communications: This criteria assigns each aggregator to the node who has more data of the file domain associated with the aggregator. The result is a new dynamic and adaptive I/O aggregator pattern based on the local data that each node stores. The new aggregator pattern proposed is dynamic, because it is calculated at runtime. Also it is adaptive, because each application has its own pattern, and could select the aggregation-criteria that reduce more the communication phase in Two-Phase I/O technique.

In the previous works, each aggregator was assigned according to the local data that each cores stores, instead of each node. In multi-core architectures, the communications intra-node are very fast because they are performed by using shared memory. Therefore, in order to aggregate date in multi-core systems, is more efficient to consider all the data that share the same communication channel. This means, all the data from all cores from the same node. For this reason, now we works with the local data that each node stores.