The Wikipedia binary pipeline I am experimenting with works quite good on EC2 or ECS, but requires 9 CPUs to work smoothly (XML to JSON conversion). What if I distribute it and run each worker process on ECS?
I want to reduce the total execution time from 47 minutes (ECS 4vCPUs) and 19 minutes (c5.2xlarge 8vCPUs) to below 10 minutes. I guess I need to scale horizontally to benefit from many running concurrently ECS tasks.
Master / Worker
Instead of running all files at once, I will define worker responsibility to download and process just one FTP file. The master responsibility is to loop over all available files and coordinate their execution using as previously multiprocessing and token system.
Where the ECS Task in my binary pipeline is defined as follows:
The execution took around 13 minutes and used as expected 27 containers. Master I run locally because it does not bound to heavy resources. The entire code is here.
The good thing I found is that the bottleneck is no longer CPU, but are FTP servers. Theoretically I could increase number of connections to each FTP server (because of the different public addresses of ECS containers), but it would be considered as non-ethical. I would rather split FTP download from its further processing.
The master code was adjusted to call sequentially two containers. Each call is throttled by dedicated throttlers.
Workers were split into two functions and each function is independent call from ECS container.
According to following dashboard the execution took around 10 minutes and used 54 containers. The FTP download still occupied most of the time, but processing could be done faster. The master script needs to run 15 child processes which adds a bit memory overhead to the driver. Additionally spinning containers adds extra time to my experiment.
The code is here.