Synopsis
Containerization is an important method for reproducible data analysis. This introduction shows a typical multi-staged Docker image building process.
Please make sure to install Docker Desktop before starting. If you would like to push to the Docker Hub, you also need to register a personal account.
Basics
We use docker build --process=plain -t <tag> .
command to build a docker container image. --process=plain
asks docker to give detailed info on each step, while -t <tag>
sets a tag for the image (if built successfully).
The tag is an identifier akin to a github release with the following structure: {user}/{repo}:{image_tag}
. For example, yeyuan98/repAnalysis:ggsashimi
points to my repAnalysis
repository with image tag ggsashimi
. If {image_tag}
is not given, it will have the default value of latest
.
The main missing piece: docker uses a specification file to perform the building process. By default, it will be one text file named Dockerfile under the current directory (as specified by .
at the end of our typical build command).
Dockerfile is run in sequence to build the image (documentation link). Each line defines one action in the form of:
|
|
We will add one new layer to the image anytime an instruction modified the container filesystem. A container image is a pile-up of layers which are immutable changes to the filesystem (so, file deletion creates a new layer and image size will NOT decrease).
First action of a Dockerfile must specify the base image with FROM
. After that, we are free to perform sequential instructions.
One commonly used technique is multi-stage building. Suppose:
- We would like to use Alpine linux as a starting point (base image).
- We need compiler toolchains for compilation of certain packages.
- Compiled packages are executed in a runtime environment.
For the case of Python:
- First get a minimal image with Alpine + Python interpreter + external runtime libraries (if needed).
- From the minimal image, add compiler toolchains and dev packages and compile all packages.
- Again from the minimal image, copy only compiled packages from the compilation image to get the final container.
The above process will:
- Simplify the final container - it does not store compiler toolchains. Only runtime and the compiled packages are present.
- Modularize the building process - compilation and endpoint images are largely independent except sharing the same minimal base image.
- Save space on the builder by layer sharing - minimal image layers are shared; moreover, compilation layers may be reused for other Python projects.
FROM-AS
instruction is the key for multi-stage building:
|
|
Note that we use COPY --from=builder
to copy compiled packages to the endpoint image.
Example: rebuild of a sashimi plot container
Below is the Dockerfile I used to build yeyuan98/repAnalysis:ggsashimi
. It is a rebuild of the guigolab/ggsashimi
that achieved ~40% space saving by using the Alpine base image.
Minimal base image contains Python, R, and a few runtime packages. Builder adds compiler toolchains for R and necessary development packages. Runtime copys compiled Python and R packages, adds the ggsashimi script and set entrypoint of the image.
Entrypoint is the command executed when the image is run by docker run
. Extra arguments passed to docker run are passed to the entrypoint command as is.
|
|
A few extra notes:
- Always put update and install commands in the same
RUN
instruction for package managers. Otherwise the update layer will be cached and reused even when subsequent install commands are modified. - It is typically good habit to pin package versions to get reproducible builds. Note that Alpine
apk
does not allow direct install of old package versions. You need to either compile from source or get alpine-tagged versions of a core package if needed (e.g., start withpostgres:13.18-alpine
as the base image instead of using the latestalpine
base.) - When copying files, it is usually better to use
ADD
and provide permanent web links to fetch the files. If you copy the files from the builder machine byCOPY
, copied files may not be updated even if they are changed on the builder because of layer caching.