Note
This tutorial assumes basic familiarity with Docker as its example involves a Docker image. It also assumes you have access to a system that can run Docker. You may wish to start with Getting Started With Docker if you are not familiar with it.
Getting Started with CWL
Tutorial Goals
Learn about the Common Workflow Language (CWL)
Create a basic CWL Tool which uses a Docker image
Run the Tool locally
Describe a sample parameterization of the Tool
Push the Tool onto GitHub
Describe Your Tool in CWL
The first step is to create is to create a CWL tool definition file. This YAML (Or JSON) file describes the inputs, outputs, and Docker image dependencies for your tool.
It is recommended that you have the following minimum fields:
doc: <description>
id: <id>
label: <label>
cwlVersion: v1.1
dct:creator:
foaf:name: <name>
We provide an example from the dockstore-tool-bamstats repository:
#!/usr/bin/env cwl-runner
class: CommandLineTool
id: "BAMStats"
label: "BAMStats tool"
cwlVersion: v1.1
doc: |
![build_status](https://quay.io/repository/collaboratory/dockstore-tool-bamstats/status)
A Docker container for the BAMStats command. See the [BAMStats](https://bamstats.sourceforge.net/) website for more information.
dct:creator:
"@id": "https://orcid.org/0000-0002-7681-6415"
foaf:name: Brian O'Connor
foaf:mbox: "mailto:briandoconnor@gmail.com"
requirements:
- class: DockerRequirement
dockerPull: "quay.io/collaboratory/dockstore-tool-bamstats:1.25-6"
hints:
- class: ResourceRequirement
coresMin: 1
ramMin: 4092 #"the process requires at least 4G of RAM
outdirMin: 512000
inputs:
mem_gb:
type: int
default: 4
doc: "The memory, in GB, for the reporting tool"
inputBinding:
position: 1
bam_input:
type: File
doc: "The BAM file used as input, it must be sorted."
format: "https://edamontology.org/format_2572"
inputBinding:
position: 2
outputs:
bamstats_report:
type: File
format: "https://edamontology.org/format_3615"
outputBinding:
glob: bamstats_report.zip
doc: "A zip file that contains the HTML report and various graphics."
baseCommand: ["bash", "/usr/local/bin/bamstats"]
$namespaces:
dct: https://purl.org/dc/terms/
foaf: https://xmlns.com/foaf/0.1/
Note
The sbg:draft-2 implementation of CWL is optimized for the Seven Bridges cloud-based platform and includes custom extensions. Dockstore does not support sbg:draft-2 CWL tools and workflows, and if you register one, Dockstore will mark the entry as invalid, and you will not be able to publish, run, or launch it on any cloud compute platform. However, we do support CWL v1.0, which defines similar functionality and supersedes the sbg:draft-2 extensions. Seven Bridges also provides instructions for how to transition tools and workflows developed in the Seven Bridges Software Development Kit to GitHub for publishing in Dockstore.
You can see this tool takes two inputs, a parameter to control memory usage and a BAM file (binary sequence alignment file). It produces one output, a zip file, that contains various HTML reports that BAMStats creates.
The CWL is actually recognized and parsed by Dockstore (when we register
this later). By default it recognizes Dockstore.cwl
but you can
customize this if you need to. One of the most important items below is
the CWL
version.
You should label your CWL with the version you are using so that CWL
tools that cannot run this version will error out appropriately. Our
tools have been tested with v1.0 and v1.1.
class: CommandLineTool
id: "BAMStats"
label: "BAMStats tool"
cwlVersion: v1.1
doc: |
![build_status](https://quay.io/repository/collaboratory/dockstore-tool-bamstats/status)
A Docker container for the BAMStats command. See the [BAMStats](https://bamstats.sourceforge.net/) website for more information.
In the code above you can see how to have an extended doc (description) which is quite useful.
dct:creator:
"@id": "https://orcid.org/0000-0002-7681-6415"
foaf:name: Brian O'Connor
foaf:mbox: "mailto:briandoconnor@gmail.com"
This section includes the tool author referenced by Dockstore. It is open to your interpretation whether that is the person that registers the tool, the person who made the Docker image, or the developer of the original tool. I’m biased towards the person that registers the tool since they are likely to be the primary contact when asking questions about how the tool was setup.
Dockstore uses the authorship information and description from the descriptor file to populate metadata for tools.
Note
If no description is defined in the descriptor file, the README from the corresponding Git repository is used.
You can register for an ORCID (a digital identifer for researchers) or use an email address for your id.
requirements:
- class: DockerRequirement
dockerPull: "quay.io/collaboratory/dockstore-tool-bamstats:1.25-6"
This section links the Docker image used for this CWL.
hints:
- class: ResourceRequirement
coresMin: 1
ramMin: 4092 # the process requires at least 4G of RAM
outdirMin: 512000
This may or may not be honoured by the executor calling this CWL, but at least it gives you a place to declare computational requirements.
inputs:
mem_gb:
type: int
default: 4
doc: "The memory, in GB, for the reporting tool"
inputBinding:
position: 1
bam_input:
type: File
doc: "The BAM file used as input, it must be sorted."
format: "https://edamontology.org/format_2572"
inputBinding:
position: 2
This is one of the items from the inputs section. Notice a few things:
The
bam_input:
matches withbam_input
in the sample parameterization JSON (shown in the next section assample_configs.local.json
).You can control the position of the variable.
It can have a type (int or File here), and, for tools that require a prefix (
--prefix
) before a parameter you can use theprefix: key
in the inputBindings section.I’m using the
format
field to specify a file format via the EDAM ontology.
outputs:
bamstats_report:
type: File
format: "https://edamontology.org/format_3615"
outputBinding:
glob: bamstats_report.zip
doc: "A zip file that contains the HTML report and various graphics."
Finally, the outputs section defines the output files. In this case, it
says in the current working directory there will be a file called
bamstats_report.zip
. When running this tool with CWL tools the file
will be copied out of the container to a location you specify in your
parameter JSON file. We’ll walk though an example in the next section.
Finally, the baseCommand
is the actual command that will be
executed. In this case, it’s the wrapper script I wrote for bamstats.
baseCommand: ["bash", "/usr/local/bin/bamstats"]
The CWL standard is continuing to evolve and hopefully we will see new features, like support for EDAM ontology terms, in future releases. In the mean time, the Gitter chat is an active community to help drive the development of CWL in positive directions and we recommend tool authors make their voices heard.
Testing Locally
So at this point, you’ve described how to call a Docker-based tool using CWL. Let’s test running the BAMStats using the Dockstore command line and descriptor, rather than just directly calling it via Docker. This will test that the CWL correctly describes how to run your tool.
The first thing I’ll do is setup the Dockstore CLI locally. This will have me install all of the dependencies needed to run the Dockstore CLI on my local machine. Make sure to install cwltool as well.
Next thing I’ll do is create a completely local dataset and JSON parameterization file:
$> wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/alignment/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
$> mv NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam /tmp/
This downloads to my current directory and then moves to /tmp
. I
could choose another location, it really doesn’t matter, but we need the
full path when dealing with the parameter JSON file. I’m using a sample
I checked in already: sample_configs.local.json
.
{
"bam_input": {
"class": "File",
"format": "https://edamontology.org/format_2572",
"path": "/tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam"
},
"bamstats_report": {
"class": "File",
"path": "/tmp/bamstats_report.zip"
}
}
Tip
The Dockstore CLI can handle inputs with HTTPS, FTP, and S3 URLs but that’s beyond the scope of this tutorial.
You can see in the above I give the full path to the input under
bam_input
and full path to the output bamstats_report
.
At this point, let’s run the tool with our local inputs and outputs via the JSON config file:
$> dockstore tool launch --local-entry Dockstore.cwl --json sample_configs.local.json
Creating directories for run of Dockstore launcher at: ./datastore//launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3
Provisioning your input files to your local machine
Downloading: #bam_input from /tmp/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam into directory: /media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/./datastore/launcher-9d9
c9bf1-7094-4a21-b2a3-1b3ad330a0a3/inputs/78a05989-6978-45b0-b6e9-5f81e7aa34ad
Calling out to cwltool to run your tool
Executing: cwltool --enable-dev --non-strict --outdir /media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/./datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/outputs/ --tmpdir-pre
fix /media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/./datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/tmp/ --tmp-outdir-prefix /media/dyuen/Data/large_volume/dockstore_tools
/dockstore-tool-bamstats/./datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/working/ /media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/Dockstore.cwl /media/dyuen/Data/large_vol
ume/dockstore_tools/dockstore-tool-bamstats/./datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/workflow_params.json
/usr/local/bin/cwltool 1.0.20170217172322
Resolved '/media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/Dockstore.cwl' to 'file:///media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/Dockstore.cwl'
[job Dockstore.cwl] /media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/working/BHsHWq$ docker \
run \
-i \
--volume=/media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/./datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/inputs/78a05989-6978-45b0-b6e9-5f81e7aa34ad/NA12878.chrom20.IL
LUMINA.bwa.CEU.low_coverage.20121211.bam:/var/lib/cwl/stgc0a728c7-a8c0-44d3-be58-031fd656eb96/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam:ro \
--volume=/media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/working/BHsHWq:/var/spool/cwl:rw \
--volume=/media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/tmp/Z8umDA:/tmp:rw \
--workdir=/var/spool/cwl \
--read-only=true \
--user=1001 \
--rm \
--env=TMPDIR=/tmp \
--env=HOME=/var/spool/cwl \
quay.io/collaboratory/dockstore-tool-bamstats:1.25-6_1.0 \
bash \
/usr/local/bin/bamstats \
4 \
/var/lib/cwl/stgc0a728c7-a8c0-44d3-be58-031fd656eb96/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
...
[job Dockstore.cwl] completed success
Final process status is success
Saving copy of cwltool stdout to: /media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/./datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/outputs/cwltool.stdout.txt
Saving copy of cwltool stderr to: /media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/./datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/outputs/cwltool.stderr.txt
Provisioning your output files to their final destinations
Uploading: #bamstats_report from /media/dyuen/Data/large_volume/dockstore_tools/dockstore-tool-bamstats/./datastore/launcher-9d9c9bf1-7094-4a21-b2a3-1b3ad330a0a3/outputs/bamstats_report.zip to : /tmp/bams
tats_report.zip
[##################################################] 100%
So that’s a lot of information, but you can see the process was a success. We get output from the command we ran and also see the file being moved to the correct output location:
$> ls -lth /tmp/bamstats_report.zip
-rw-rw-r-- 1 ubuntu ubuntu 32K Jun 16 02:14 /tmp/bamstats_report.zip
The output looks fine, just what we’d expect.
So what’s going on here? What’s the Dockstore CLI doing? It can best be summed up with this image:
The command line first provisions files. In our case, the files were
local so no provisioning was needed. But as the Tip above mentioned,
these can be various URLs. After provisioning the docker image is pulled
and ran via the cwltool
command line. This uses the
Dockerfile.cwl
and parameterization JSON file
(sample_configs.local.json
) to construct the underlying
docker run
command. Finally, the Dockstore CLI provisions files
back. In this case it’s just a file copy to /tmp/bamstats_report.zip
but it could copy the result to a destination in S3 for example.
Tip
You can use --debug
to get much more information during
this run, including the actual call to cwltool (which can be super
helpful in debugging).
Tip
The dockstore
CLI automatically creates a datastore
directory in the current working directory where you execute the command
and uses it for inputs/outputs. It can get quite large depending on the
tool/inputs/outputs being used. Plan accordingly, e.g. execute the
dockstore CLI in a directory located on a partition with sufficient
storage.
Adding a Test Parameter File
We are able to register the above input parameterization of the tool into Dockstore so that users can see and test an example with our tool. Users can manually add test parameter files for a given tool tag or workflow version through both the command line and the versions tab in the UI.
Tip
Make sure that any required input files are given as publically accessible URLs so that a user can run the example successfully.
Next Steps
Follow the next tutorial to create an account on Dockstore and link third party services.