I have a pure Python script which moves data from FTP, ungzips it, converts XML to JSON and writes it back to S3. It runs quite fast on single c5.2xlarge machine with multiprocessing and does not work on AWS Glue PythonShell (deployments issues). How about moving it to the AWS ECS?
Currently my python script uses 3 FTP servers to download *.xml.gz files. Each file is around 1GB and each FTP can support up to 3 connections. The code spins up to 9 processes and utilizes a queue to acquire a token which allows to access FTP server without crossing connection quota.
I need to have ECR, ECS, IAM role and a task.
In the image I need to copy only two collected artifacts.
Because I am going to run docker container I need to prepare it first. The following script should do it:
I ran it and it took around 47 minutes, comparing to c5.2xlarge where it took 19 minutes. As expected CPU is the bottleneck.