The ECS Task can be started from the other machine and can be waited till completed or failed. The logs are streamed directly to CloudWatch. The waiting process can poll and show them while waiting for the task.
The problem occurs if I run 20 or more tasks in my binary pipeline. I know something is working and even progressing, but I do not want to see just ECS task started and complete messages. I don’t want to check logs manually in the AWS CloudWatch neither. …
The code may behave differently than developer expects. I mean not the correctness, but the performance. The performance bottleneck may be easy to found by running the profiler. cProfile is the way to do it in Python.
I am going to run my binary pipeline code twice using the profiler. First time in my local computer where the main bottleneck is my internet connection. The second time, on AWS ECS node, where the bottleneck should be CPU.
I ran locally the script as I used to do, but wrapped it around the cProfiler:
pipenv run python -m cProfile -s tottime…
Handling big files without loading them into memory requires processing them in chunks. I am going to show you how to sort 10GB JSON file in pure Python in non-distributed environment.
Generally the idea is to split big JSON file into consistent chunks (512MB) which you can sort in memory. Them persist them temporarily back in S3. And finally merge them by streaming into final big JSON file.
To make things simpler for me I am going to use my binary pipeline for the implementation. The actual code (not pseudo) looks like the following snippet:
The binary pipeline exposes…
Date or time ranges are common attributes in the database. Quite often something starts and ends. The concurrency may these beings be defined as how many of the are happening at the same time. Let’s calculate it.
There is two tables with following columns: id, start_at and end_at. The first table contains 11M rows, the second one almost 400M.
Theoretically the problem can be reduced to discreet values and the grouping may be applied. If we know all the timestamps when the value may change (when it starts or ends) we can build a timeline to use for evaluation using…
File systems tend to expose creation and modification dates of each entry. AWS S3 is not a file system, but exposes “Last Modified” date which is a bit confusing, because S3 object is not modifiable, but can be overwritten.
The goal of the experiment is to figure out how it really behaves, especially in the multipart upload scenario. I have an output file which I am going to write in 128MB parts using my slow internet connection. It is going to take several minutes to upload 1GB.
The process created multipart upload at 14:00, the first part completed at 14:01…
The Python’s old school way towards concurrency is multiprocessing module which spawns new processes. The asyncio module is a step to modernity. I will migrate my small data pipeline and compare the performance.
The classic multiprocessing can be triggered by the Pool. You do not need to take care of anything just to pass list of tasks you want to run concurrently.
The modern asyncio brings the same simplicity what multiprocessing. You need to have a Pool of threads and schedule some tasks. The threads are good enough if you are going to run IO intensive tasks.
Many tools and frameworks perform uploading files to S3 in parts to gain performance or overcome 5GB limitation. If the process is terminated and not completed it may still occupy storage space.
You can list all multipart uploads in bash for each bucket:
aws s3api list-multipart-uploads --bucket=
You can even remove them manually by calling:
aws s3api abort-multipart-upload --bucket= --key= --upload-id=
AWS already provides solution for it which may be just enabled. I was always told that S3 Lifecycle Rules are for moving objects between storage classes. It turns out you can do much more with it. Let’s enable incomplete multipart upload deletion.
If everything went right you will see in the Console newly added and enabled rule.
The AWS ECS offers nice way of running Python containers with ETL work. But spinning hundreds of containers with default settings does not uses cost effective pricing model. In non-essential work you can start bidding spots.
The minimal changes in the infrastructure should be done in the way how the ECS cluster is provisioned. Here is the version which defines only Fargate Spot as default capacity provider.
The most important change in the code to run ECS task is to not specify launch type at all. I left Fargate there and it still ran, but I could not be able to confirm if the spot was used.
How to be sure if it runs as expected? In the console, in the task you should see new property describing chosen capacity.
The Wikipedia binary pipeline I am experimenting with works quite good on EC2 or ECS, but requires 9 CPUs to work smoothly (XML to JSON conversion). What if I distribute it and run each worker process on ECS?
I want to reduce the total execution time from 47 minutes (ECS 4vCPUs) and 19 minutes (c5.2xlarge 8vCPUs) to below 10 minutes. I guess I need to scale horizontally to benefit from many running concurrently ECS tasks.
Instead of running all files at once, I will define worker responsibility to download and process just one FTP file. The master responsibility is to…
I have a pure Python script which moves data from FTP, ungzips it, converts XML to JSON and writes it back to S3. It runs quite fast on single c5.2xlarge machine with multiprocessing and does not work on AWS Glue PythonShell (deployments issues). How about moving it to the AWS ECS?
Currently my python script uses 3 FTP servers to download *.xml.gz files. Each file is around 1GB and each FTP can support up to 3 connections. …
Software Developer, Data Engineer with solid knowledge of Business Intelligence. Passionate about programming.