Synopsis

Containerization is an important method for reproducible data analysis. This introduction shows a typical multi-staged Docker image building process.

Please make sure to install Docker Desktop before starting. If you would like to push to the Docker Hub, you also need to register a personal account.

Basics

We use docker build --process=plain -t <tag> . command to build a docker container image. --process=plain asks docker to give detailed info on each step, while -t <tag> sets a tag for the image (if built successfully).

The tag is an identifier akin to a github release with the following structure: {user}/{repo}:{image_tag}. For example, yeyuan98/repAnalysis:ggsashimi points to my repAnalysis repository with image tag ggsashimi. If {image_tag} is not given, it will have the default value of latest.

The main missing piece: docker uses a specification file to perform the building process. By default, it will be one text file named Dockerfile under the current directory (as specified by . at the end of our typical build command).

Dockerfile is run in sequence to build the image (documentation link). Each line defines one action in the form of:

1
INSTRUCTION arguments

We will add one new layer to the image anytime an instruction modified the container filesystem. A container image is a pile-up of layers which are immutable changes to the filesystem (so, file deletion creates a new layer and image size will NOT decrease).

First action of a Dockerfile must specify the base image with FROM. After that, we are free to perform sequential instructions.

One commonly used technique is multi-stage building. Suppose:

  1. We would like to use Alpine linux as a starting point (base image).
  2. We need compiler toolchains for compilation of certain packages.
  3. Compiled packages are executed in a runtime environment.

For the case of Python:

  1. First get a minimal image with Alpine + Python interpreter + external runtime libraries (if needed).
  2. From the minimal image, add compiler toolchains and dev packages and compile all packages.
  3. Again from the minimal image, copy only compiled packages from the compilation image to get the final container.

The above process will:

  1. Simplify the final container - it does not store compiler toolchains. Only runtime and the compiled packages are present.
  2. Modularize the building process - compilation and endpoint images are largely independent except sharing the same minimal base image.
  3. Save space on the builder by layer sharing - minimal image layers are shared; moreover, compilation layers may be reused for other Python projects.

FROM-AS instruction is the key for multi-stage building:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Base on Alpine
FROM alpine:3.21 AS base
# ... INSTRUCTIONS FOR BASE

# Builder phase
FROM base AS builder
# ... INSTRUCTIONS FOR BUILDER

# Final phase
FROM base
# ... INSTRUCTIONS FOR ENDPOINT
COPY --from=builder <SRC-PATH_ON_BUILDER> <DEST_ON_ENDPOINT>

Note that we use COPY --from=builder to copy compiled packages to the endpoint image.

Example: rebuild of a sashimi plot container

Below is the Dockerfile I used to build yeyuan98/repAnalysis:ggsashimi. It is a rebuild of the guigolab/ggsashimi that achieved ~40% space saving by using the Alpine base image.

Minimal base image contains Python, R, and a few runtime packages. Builder adds compiler toolchains for R and necessary development packages. Runtime copys compiled Python and R packages, adds the ggsashimi script and set entrypoint of the image.

Entrypoint is the command executed when the image is run by docker run. Extra arguments passed to docker run are passed to the entrypoint command as is.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Base on Alpine
FROM alpine:3.21 AS base

# Install Python and R
RUN apk update \
    &&  apk add \
         python3 \
		 R \
		 fontconfig \
		 freetype \
		 zlib

# Builder phase
FROM base AS builder

# Install Python and R
RUN apk update \
    &&  apk add \
		 fontconfig-dev \
		 freetype-dev \
		 R-dev \
		 zlib-dev \
		 g++
		 
# Check version
RUN Rscript --version
RUN python3 --version

# Install R packages
RUN echo 'options(repos = "https://repo.miserver.it.umich.edu/cran/")' > ~/.Rprofile
ARG GGPLOT_VER=3.5.0
RUN Rscript -e 'install.packages("remotes", INSTALL_opts = "--no-docs", clean = TRUE);' && \
    Rscript -e 'remotes::install_version("ggplot2", version="'${GGPLOT_VER}'", Ncpus = 6, INSTALL_opts = "--no-docs")' && \
    Rscript -e 'remotes::install_cran(c("gridExtra", "data.table", "svglite", Ncpus = 6, INSTALL_opts = "--no-docs"))'

# Install Python packages
RUN apk update \
    &&  apk add \
		 py3-pip \
		 python3-dev
RUN pip install --break-system-packages pysam

# Report paths
RUN python3 -c 'import sys; print(sys.path)'
RUN Rscript -e 'print(.libPaths())'

# Final phase
FROM base

LABEL maintainer="Ye Yuan <yeyu@umich.edu>" \
      version="ggsashimi_v1.1.5" \
      description="Docker image for ggsashimi based on Alpine. Original at guigolab/ggsashimi."

# Copy build artefacts
COPY --from=builder /usr/lib/python3.12/site-packages /usr/lib/python3.12/site-packages
COPY --from=builder /usr/lib/python3.12/lib-dynload /usr/lib/python3.12/lib-dynload
COPY --from=builder /usr/lib/R/library /usr/lib/R/library

# Copy ggsashimi
ADD --chmod=755 https://raw.githubusercontent.com/guigolab/ggsashimi/refs/heads/master/ggsashimi.py /bin/ggsashimi.py
RUN ggsashimi.py --version

ENTRYPOINT ["/bin/ggsashimi.py"]

A few extra notes:

  1. Always put update and install commands in the same RUN instruction for package managers. Otherwise the update layer will be cached and reused even when subsequent install commands are modified.
  2. It is typically good habit to pin package versions to get reproducible builds. Note that Alpine apk does not allow direct install of old package versions. You need to either compile from source or get alpine-tagged versions of a core package if needed (e.g., start with postgres:13.18-alpine as the base image instead of using the latest alpine base.)
  3. When copying files, it is usually better to use ADD and provide permanent web links to fetch the files. If you copy the files from the builder machine by COPY, copied files may not be updated even if they are changed on the builder because of layer caching.