Running AWS Glue Pythonshell
AWS is a service to do a lot of things around ETL. One piece of it are the jobs which allow you to run your code. Here I am going to show you how to run the simplest job written in the Python.
Goal
I aim to download just single Wikipedia archive file from FTP directly to S3. The file size is around 65GB and will be transferred on-the-fly by the python script.
Infrastructure
We need a bucket, glue job, IAM role with some permissions, a python script to be executed by the job and SSM parameter pointing at our bucket.
Python script
The code is just kind of abstraction of samples I did before. The main idea is to transfer data between FTP and S3 without storing it locally.
Outcome
How long did it take? 49 minutes.