While the other modern programming languages support light threads or async/await pattern, Python prefers concurrency in a bit old-school way, using child processes. Here I have short introduction to multiprocessing package with examples.
Downloading from FTP, unpacking and writing to S3 takes a lot of time. It takes 40 minutes to process 35GB of data. It should be possible to handle the entire pipelines faster, shouldn’t it?
The easiest way to start with concurrency is to use the Pool. It allows you to process your list in multiple processes without any boilerplate overhead.
The approach reduces load to 13 minutes (40 minutes if each file is downloaded sequencially). This time the bottleneck is the FTP server which limits the number of active sessions from the same IP.
If the FTP is a bottleneck I need to try with multiple mirrors. Lets create something like token system. Each token will contain the FTP server address with its inner directory. If a token is obtained the process can use this server. By doing it we can limit number of active connections per host, and scale the application to use multiple servers.
To implement token system between processes I decided to use Queue: dequeue returns a token, put puts it back. And here are the changes:
The run took 7 minutes and benefits from 3 Wikipedia mirrors. If you are interested in full implementation, the code you will find here.