Building a Spark/PySpark image from Spark Binaries

3 min readJul 2, 2021

Image Source and Credits: https://spark.apache.org/

After weeks of procrastination and a 3 am urge to finish all my back-burner tasks in a night, I finally decided to document a better approach I found to building an Apache Spark docker image. The motivation behind this article is to document a more efficient and simpler approach than the one commonly used for writing a Dockerfile that can be used to run Spark processes. (I am assuming here that the reader has a basic knowledge of Dockerfile commands and an overview of Spark. PS- That’s all I needed to write this article)

The Task

To iron out any ambiguities, the task here is to write an optimal Dockerfile that creates a container capable of running Spark processes.

The commonly-used approach

To solve this problem, I did a simple github search to see other similar dockerfiles running spark. After going through many repositories, it was clear that this template was more or less being followed: using a Linux OS as a base image and installing the Spark Binaries in the container. However, this approach clearly has some problems.

Problems with this approach

Using a base image of a Linux OS like Ubuntu means including a lot of unnecessary features that have nothing to do with Spark. I mean, do we really need this command:

Image Source and Credits: https://www.unixmen.com

The image created is unnecssarily large due to the extra features included. Hence, we should try a better approach, where anything useless is cut out.

The Efficient Solution

An efficient approach would involve an image that takes up less storage space and does not include any unnecessary funtionality. Thankfully, Apache provides one. In the Spark binaries, Apache provides a shell script for creating a docker image for Spark & PySpark in the bin folder. Download the Spark binaries and run the docker-image-tool.sh script. Simple as that.

But here is the strange thing. I couldn’t find any blogs, github repos, etc. discussing this approach, which means that there is very scarce documentation of this script(well that, or I am bad at Googling). The only official documentation from Apache I found about this approach, was inside the .sh file as comments. Hence, the goal of this article is to provide some sort of documentation for building the image provided by Apache Spark. So let’s dive into how we can build a Spark/PySpark image from this script.

Building the image

cd into the home folder of the Spark binaries directory.

To build a Spark image:

./bin/docker-image-tool.sh build

To build a PySpark image(also builds the Spark image):

./bin/docker-image-tool.sh -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

The above steps are documented as comments in docker-image-tool.sh.
Now you can see the images built on your system. Notice that the space taken by the Spark and PySpark images is 532 mb and 910 mb respectively, which would be less than that from using the commonly-used approach.

Writing a dockerfile with this image

Now that we have an image named spark on our system, let us write a Dockerfile with spark as the base image(same steps are valid for PySpark):

Set the base image as spark (or PySpark)
Set WORKDIR as /opt/spark
Set USER as root to run apt-get
Install python3(if running .py files) and install pkg-config & libcairo2-dev to fix pycairo dependency issues
Add to CMD whatever you wish to run, ex. spark-submit

And that was it! Just run a single simple script to get a Spark image.