EMR, Pyspark and Wikipedia

Nov 17, 2020

EMR is set of tools to process big data. Spark is one of them and I am going to give a try to read 35GB of Wikipedia raw XML files. I did recently comparable check with AWS Glue Job and Snowflake database.

Local run

To start very small and to verify the code let’s try with Spark and Jupyter inside docker image. I pull the image and start it:

docker run -p 8888:8888 jupyter/pyspark-notebook

The output gives me the link to the Jupyter notebook, where I run the following python code. It returns 4M rows which sounds good.

Infrastructure

EMR requires a bit already prepared infrastructure, but not the cluster. It is more about networking and IAM permissions. I will use it later by scheduled job execution.

Job

The job is written in Python and just reads data from S3, takes id column and writes it back to S3 in 27 files.

Submit

The job needs to be submitted to the EMR. I don’t want to have constantly running cluster. It can be created just to run my job.

Execution

The entire run on MASTER (m4.large) and 4 CORE (m4.large) took around 2 hours! If I did not know the Snowflake I would believe it is OK.

EMR, Pyspark and Wikipedia

Local run

Infrastructure

Job

Submit

Execution

Written by Adrian Macal

No responses yet