EMR, Pyspark and Wikipedia

EMR is set of tools to process big data. Spark is one of them and I am going to give a try to read 35GB of Wikipedia raw XML files. I did recently comparable check with AWS Glue Job and Snowflake database.

Local run

To start very small and to verify the code let’s try with Spark and Jupyter inside docker image. I pull the image and start it:

The output gives me the link to the Jupyter notebook, where I run the following python code. It returns 4M rows which sounds good.

Infrastructure

EMR requires a bit already prepared infrastructure, but not the cluster. It is more about networking and IAM permissions. I will use it later by scheduled job execution.

Job

The job is written in Python and just reads data from S3, takes id column and writes it back to S3 in 27 files.

Submit

The job needs to be submitted to the EMR. I don’t want to have constantly running cluster. It can be created just to run my job.

Execution

The entire run on MASTER (m4.large) and 4 CORE (m4.large) took around 2 hours! If I did not know the Snowflake I would believe it is OK.