This is the first part of a tutorial series where you will create a tool called BAMStats and publish it onto Dockstore.
Getting Started with Docker¶
- Learn about Docker
- Create a Docker image for a real tool
- Create a tag locally
- Test Docker image locally
Introduction to Docker¶
Docker is a fantastic tool for creating light-weight containers to run your tools. It gives you a fast, VM-like environment for Linux where you can automatically install dependencies, make configurations, and setup your tool exactly the way you want, just as you would on a “normal” Linux host. You can then quickly and easily share these Docker images with the world using registries like Quay.io (indexed by Dockstore), Docker Hub, and GitLab.
Here we will go through a simple representative example. The end-product is a Dockerfile for a BAMStats tool stored in a supported Git repository.
Create a new repository¶
For the rest of this tutorial, you may wish to work in your own repository with your own tool or “fork” the repository above into your own GitHub account.
With a repository established in GitHub, the next step is to create the Docker image with BAMStats correctly installed.
Creating a Dockerfile¶
We will create a Docker image with BAMStats and all of its dependencies
installed. To do this we must create a
Dockerfile. Here’s my sample
############################################################# # Dockerfile to build a sample tool container for BAMStats ############################################################# # Set the base image to Ubuntu FROM ubuntu:14.04 # File Author / Maintainer MAINTAINER Brian OConnor <firstname.lastname@example.org> # Setup packages USER root RUN apt-get -m update && apt-get install -y wget unzip openjdk-7-jre zip # get the tool and install it in /usr/local/bin RUN wget -q http://downloads.sourceforge.net/project/bamstats/BAMStats-1.25.zip RUN unzip BAMStats-1.25.zip && \ rm BAMStats-1.25.zip && \ mv BAMStats-1.25 /opt/ COPY bin/bamstats /usr/local/bin/ RUN chmod a+x /usr/local/bin/bamstats # switch back to the ubuntu user so this tool (and the files written) are not owned by root RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 ubuntu USER ubuntu # by default /bin/bash is executed CMD ["/bin/bash"]
This Dockerfile has a lot going on in it. There are good tutorials online about the details of a Dockerfile and its syntax. An excellent resource is the Docker website itself, including the Best practices for writing Dockerfiles webpage. I’ll highlight some sections below:
This uses the ubuntu 14.04 base distribution. How do I know to use
ubuntu:14.04? This comes from either a search on Ubuntu’s home page
for their “official” Docker images or you can simply go to
DockerHub or Quay and
search for whatever base image you like. You can extend anything you
find there. So if you come across an image that contains most of what
you want, you can use it as the base here. Just be aware of its source:
I tend to stick with “official”, basic images for security reasons.
MAINTAINER Brian OConnor <email@example.com>
You should include your name and contact information.
USER root RUN apt-get -m update && apt-get install -y wget unzip openjdk-7-jre zip RUN wget -q http://downloads.sourceforge.net/project/bamstats/BAMStats-1.25.zip RUN unzip BAMStats-1.25.zip && \ rm BAMStats-1.25.zip && \ mv BAMStats-1.25 /opt/
This switches to the
root user to perform software installs. It
downloads BAMStats, unzips it, and installs it in the correct location,
This is why Docker is so powerful. On HPC systems the above process might take days or weeks of working with a sys admin to install dependencies on all compute nodes. Here I can control and install whatever I like inside my Docker image - correctly configuring the environment for my tool and avoiding the time to set up these dependencies in the places I want to run. This greatly simplifies the install process for other users that you share your tool with as well.
COPY bin/bamstats /usr/local/bin/ RUN chmod a+x /usr/local/bin/bamstats
This copies the local helper script
bamstats from the git checkout
/usr/local/bin. This is an important example; it shows
how to use
COPY to copy files in the git directory structure to
inside the Docker image. After copying to
/usr/local/bin the script
is made runnable by all users.
RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 ubuntu USER ubuntu # by default /bin/bash is executed CMD ["/bin/bash"]
ubuntu is created and switched to in order to make file
ownership easier and the default command for this Docker image is set to
/bin/bash which is a typical default.
An important thing to note is that this
Dockerfile only scratches
the surface. Take a look at Best practices for writing
for a really terrific in-depth look at writing Dockerfiles.
Building Docker Images¶
Now that you’ve created the
Dockerfile, the next step is to build
the image. The docker command line is used for this:
$> docker build -t quay.io/collaboratory/dockstore-tool-bamstats:1.25-3 .
. is the path to the location of the Dockerfile, which is in the
same directory here. The
-t parameter is the “tag” that this Docker
image will be called locally when it’s cached on your host. A few things
to point out, the
quay.io part of the tag typically denotes that
this was built on Quay.io (which we will see in a later section). I’m
manually specifying this tag so it will match the Quay.io-built version.
This allows me to build and test locally then, eventually, switch over
to the quay.io-built version. The next part of the tag,
collaboratory/dockstore-tool-bamstats, denotes the name of the tool
which is derived from the organization and repository name on Quay.io.
1.25-3 denotes a version string, typically you want to sync
that with releases on GitHub.
The tool should build normally and should exit without errors. You should see something like:
Successfully built 01a7ccf55063
Check that the tool is now in your local Docker image cache:
$> docker images | grep bamstats quay.io/collaboratory/dockstore-tool-bamstats 1.25-3 01a7ccf55063 2 minutes ago 538.3 MB
Great! This looks fine!
Testing the Docker Image Locally¶
OK, so you’ve built the image and created a tag. Now what?
The next step will be to test the tool directly via Docker to ensure
Dockerfile is valid and correctly installed the tool. If
you were developing a new tool there might be multiple rounds of
docker build, followed by testing with
docker run before you get
your Dockerfile right. Here I’m executing the Docker image, launching it
as a container (make sure you launch on a host with at least 8GB of RAM
and dozens of GB of disk space!):
$> docker run -it -v `pwd`:/home/ubuntu --user `echo $UID`:1000 quay.io/collaboratory/dockstore-tool-bamstats:1.25-3 /bin/bash
This command expects your UID to be 1000. If it is not, you
need to add
You’ll be dropped into a bash shell which works just like the Linux
environments you normally work in. I’ll come back to what
doing in a bit. The goal now is to exercise the tool and make sure it
works as you expect. BAMStats is a very simple tool and generates some
reports and statistics for a BAM file. Let’s run it on some test data
from the 1000 Genomes project:
# this is inside the running Docker container $> cd /home/ubuntu $> wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam # if the above doesn't work here's an alternative location $> wget https://s3.amazonaws.com/oconnor-test-bucket/sample-data/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam $> /usr/local/bin/bamstats 4 NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
What’s really going on here? The
bamstats command above is a simple
script I wrote to make it easier to call BAMStats. This is what I used
COPY command to move into the Docker image via the Dockerfile.
Here’s the script’s contents:
#!/bin/bash set -euf -o pipefail java -Xmx$1g -jar /opt/BAMStats-1.25/BAMStats-1.25.jar -i $2 -o bamstats_report.html -v html zip -r bamstats_report.zip bamstats_report.html bamstats_report.html.data rm -rf bamstats_report.html bamstats_report.html.data
You can see it just executes the BAMStats jar - passing in the GB of memory and the BAM file while collecting the output HTML report as a zip file followed by cleanup.
Notice how the output is written to whatever the current directory is. This is the correct directory to put your output in since the CWL tool described later assumes that outputs are all located in the current working directory that it executes your command in.
-v parameter used earlier is mounting the current working
/home/ubuntu which was the directory we worked in
/usr/local/bin/bamstats above. The net effect is when
you exit the Docker container (with command
exit or pressing
ctrl + d), you’re left with a
bamstats_report.zip file in the
current directory. This is a key point, it shows you how files are
retrieved from inside a Docker container.
You can now unzip and examine the
bamstats_report.zip file on your
computer to see what type of reports are created by this tool. For
example, here’s a snippet:
Rather than interactively working with the image, you could also run your Docker image from the command-line.
$> wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam $> docker run -w="/home/ubuntu" -it -v `pwd`:/home/ubuntu --user `echo $UID`:1000 quay.io/collaboratory/dockstore-tool-bamstats:1.25-3 bamstats 4 NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
In the next section, we will demonstrate how the command-line and input file can be parameterized and constructed via CWL.
You could stop here! However, what you lose is a standardized way to describe how to run your tool. That’s what descriptor languages and Dockstore provide. We think it’s valuable and there’s an increasing number of tools and workflows designed to work with various descriptor languages so there are benefits to not just stopping here.
There are three descriptor languages available on Dockstore. Follow the links to get an introduction.