The ECS Task can be started from the other machine and can be waited till completed or failed. The logs are streamed directly to CloudWatch. The waiting process can poll and show them while waiting for the task.

Idea

The problem occurs if I run 20 or more tasks in my binary pipeline. I know something is working and even progressing, but I do not want to see just ECS task started and complete messages. I don’t want to check logs manually in the AWS CloudWatch neither. …


The code may behave differently than developer expects. I mean not the correctness, but the performance. The performance bottleneck may be easy to found by running the profiler. cProfile is the way to do it in Python.

Goal

I am going to run my binary pipeline code twice using the profiler. First time in my local computer where the main bottleneck is my internet connection. The second time, on AWS ECS node, where the bottleneck should be CPU.

Benchmark

I ran locally the script as I used to do, but wrapped it around the cProfiler:

pipenv run python -m cProfile -s tottime…

Handling big files without loading them into memory requires processing them in chunks. I am going to show you how to sort 10GB JSON file in pure Python in non-distributed environment.

Algorithm

Generally the idea is to split big JSON file into consistent chunks (512MB) which you can sort in memory. Them persist them temporarily back in S3. And finally merge them by streaming into final big JSON file.

Pipeline

To make things simpler for me I am going to use my binary pipeline for the implementation. The actual code (not pseudo) looks like the following snippet:

Components

The binary pipeline exposes…


Date or time ranges are common attributes in the database. Quite often something starts and ends. The concurrency may these beings be defined as how many of the are happening at the same time. Let’s calculate it.

Setup

There is two tables with following columns: id, start_at and end_at. The first table contains 11M rows, the second one almost 400M.

Solution #1 — Naive

Theoretically the problem can be reduced to discreet values and the grouping may be applied. If we know all the timestamps when the value may change (when it starts or ends) we can build a timeline to use for evaluation using…


File systems tend to expose creation and modification dates of each entry. AWS S3 is not a file system, but exposes “Last Modified” date which is a bit confusing, because S3 object is not modifiable, but can be overwritten.

Goal

The goal of the experiment is to figure out how it really behaves, especially in the multipart upload scenario. I have an output file which I am going to write in 128MB parts using my slow internet connection. It is going to take several minutes to upload 1GB.

Log

The process created multipart upload at 14:00, the first part completed at 14:01…


The Python’s old school way towards concurrency is multiprocessing module which spawns new processes. The asyncio module is a step to modernity. I will migrate my small data pipeline and compare the performance.

Before changes

The classic multiprocessing can be triggered by the Pool. You do not need to take care of anything just to pass list of tasks you want to run concurrently.

After changes

The modern asyncio brings the same simplicity what multiprocessing. You need to have a Pool of threads and schedule some tasks. The threads are good enough if you are going to run IO intensive tasks.

Comparison

In…


Many tools and frameworks perform uploading files to S3 in parts to gain performance or overcome 5GB limitation. If the process is terminated and not completed it may still occupy storage space.

CLI

You can list all multipart uploads in bash for each bucket:

aws s3api list-multipart-uploads --bucket=

You can even remove them manually by calling:

aws s3api abort-multipart-upload --bucket= --key= --upload-id=

Automation

AWS already provides solution for it which may be just enabled. I was always told that S3 Lifecycle Rules are for moving objects between storage classes. It turns out you can do much more with it. Let’s enable incomplete multipart upload deletion.

Outcome

If everything went right you will see in the Console newly added and enabled rule.


The AWS ECS offers nice way of running Python containers with ETL work. But spinning hundreds of containers with default settings does not uses cost effective pricing model. In non-essential work you can start bidding spots.

Infrastructure

The minimal changes in the infrastructure should be done in the way how the ECS cluster is provisioned. Here is the version which defines only Fargate Spot as default capacity provider.

Running tasks

The most important change in the code to run ECS task is to not specify launch type at all. I left Fargate there and it still ran, but I could not be able to confirm if the spot was used.

Confirmation

How to be sure if it runs as expected? In the console, in the task you should see new property describing chosen capacity.


The Wikipedia binary pipeline I am experimenting with works quite good on EC2 or ECS, but requires 9 CPUs to work smoothly (XML to JSON conversion). What if I distribute it and run each worker process on ECS?

Goal

I want to reduce the total execution time from 47 minutes (ECS 4vCPUs) and 19 minutes (c5.2xlarge 8vCPUs) to below 10 minutes. I guess I need to scale horizontally to benefit from many running concurrently ECS tasks.

Master / Worker

Instead of running all files at once, I will define worker responsibility to download and process just one FTP file. The master responsibility is to…


I have a pure Python script which moves data from FTP, ungzips it, converts XML to JSON and writes it back to S3. It runs quite fast on single c5.2xlarge machine with multiprocessing and does not work on AWS Glue PythonShell (deployments issues). How about moving it to the AWS ECS?

Now

Currently my python script uses 3 FTP servers to download *.xml.gz files. Each file is around 1GB and each FTP can support up to 3 connections. …

Adrian Macal

Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store