Data are moved between systems in binary format. The medium is a network, a file system or something else, but the lowest layer is always a stream of bytes. You do not need to look into the data to move it. Here I have some code I played with such binary pipeline.
I would like to have a high level pipeline abstraction in Python which looks like the following snippet. The code opens remote file, computes md5 and sha1, ungzip it, computes md5 and sha1 and writes it the output to S3.
The whole implementation is here and sample output may look like this:
Each component exposes three important functions.
The pipeline run defined workflow by binding all its members and starting the first component. Everything is expected to work in the push mode, which means whenever we got some data from previous component we need to push data to next one as soon as possible. At the end we may flush our state.