Running AWS Glue Pythonshell

AWS is a service to do a lot of things around ETL. One piece of it are the jobs which allow you to run your code. Here I am going to show you how to run the simplest job written in the Python.

Goal

I aim to download just single Wikipedia archive file from FTP directly to S3. The file size is around 65GB and will be transferred on-the-fly by the python script.

Infrastructure

We need a bucket, glue job, IAM role with some permissions, a python script to be executed by the job and SSM parameter pointing at our bucket.

Python script

The code is just kind of abstraction of samples I did before. The main idea is to transfer data between FTP and S3 without storing it locally.

Outcome

How long did it take? 49 minutes.