EMR, Pyspark and Wikipedia
EMR is set of tools to process big data. Spark is one of them and I am going to give a try to read 35GB of Wikipedia raw XML files. I did recently comparable check with AWS Glue Job and Snowflake database.
Local run
To start very small and to verify the code let’s try with Spark and Jupyter inside docker image. I pull the image and start it:
docker run -p 8888:8888 jupyter/pyspark-notebook
The output gives me the link to the Jupyter notebook, where I run the following python code. It returns 4M rows which sounds good.
Infrastructure
EMR requires a bit already prepared infrastructure, but not the cluster. It is more about networking and IAM permissions. I will use it later by scheduled job execution.
Job
The job is written in Python and just reads data from S3, takes id column and writes it back to S3 in 27 files.
Submit
The job needs to be submitted to the EMR. I don’t want to have constantly running cluster. It can be created just to run my job.
Execution
The entire run on MASTER (m4.large) and 4 CORE (m4.large) took around 2 hours! If I did not know the Snowflake I would believe it is OK.