PaPy: Parallel and distributed data-processing pipelines in Python
Marcin Cieślik
mpc4p@virginia.edu -
University of Virginia, U.S.
Cameron Mura
cmura@virginia.edu -
University of Virginia, U.S.
PaPy, which stands for parallel pipelines in Python, is a highly flexible
framework that enables the construction of robust, scalable workflows for
either generating or processing voluminous datasets. A workflow is created from
user-written Python functions (nodes) connected by 'pipes' (edges) into a
directed acyclic graph. These functions are arbitrarily definable, and can make
use of any Python modules or external binaries. Given a user-defined topology
and collection of input data, functions are composed into nested higher-order
maps, which are transparently and robustly evaluated in parallel on a single
computer or on remote hosts. Local and remote computational resources can be
flexibly pooled and assigned to functional nodes, thereby allowing facile
load-balancing and pipeline optimization to maximimize computational
throughput. Input items are processed by nodes in parallel, and traverse the
graph in batches of adjustable size - a trade-off between lazy-evaluation,
parallelism, and memory consumption. The processing of a single item can be
parallelized in a scatter/gather scheme. The simplicity and flexibility of
distributed workflows using PaPy bridges the gap between desktop -> grid,
enabling this new computing paradigm to be leveraged in the processing of large
scientific datasets.
Citation
M Cieślik, C Mura, PaPy: Parallel and distributed data-processing pipelines in Python in Proceedings of the 8th Python in Science conference (SciPy 2009), G Varoquaux, S van der Walt, J Millman (Eds.), pp. 41-47
BibTeX entryFull text PDF
Copyright
The content of the articles of the Proceedings of the Python in
Science Conference is copyrighted and owned by their original
authors.
Terms of use For republication or other use of the material published, please contact the copyright owners to obtain permission.
Terms of use For republication or other use of the material published, please contact the copyright owners to obtain permission.