Azure Batch
Azure Batch has been created to provide a simple way of running containers and simple commands on Azure without you having to closely manage the underlying VM infrastructure (although a knowledge of the underlying infrastructure will always be useful). While Azure Batch does not have an understanding of CWL like a full-on workflow engine, it does provide a very simple way to run a large number of Dockstore tools at scale.
Azure Batch also provides a client-side tool called Batch Shipyard which provides a number of features including a simple command-line interface for submitting batch jobs.
Of course, keep in mind that if you have a knowledge of CWL and/or do not need the Dockstore command-line to do file provisioning, you can decompose the underlying command-line invocation for the tool and use that as the command for your jobs, gaining a bit of performance. This tutorial focuses on using cwltool and using the Dockstore command-line to provide an experience that is more akin to running Dockstore or cwltool on the command-line out of the box.
Run through Azure Shipyard’s Linux Installation Guide and then the Quickstart guide with one of the sample tools such as Torch-CPU.
With the shipyyard CLI setup, get the md5sum sample recipes from GitHub
$ git clone https://github.com/dockstore/batch_wrapper.git $ cd batch_wrapper/azure/
Fill out your
config.json
,credentials.json
, andjobs.json
inconfig.dockstore.md5sum
. If you have trouble finding your access keys, take a look at this article. Injobs.json
note that we use AWS keys to provision or save the final output files. You will also need to modify the parameter json filemd5sum.s3.json
to reflect the location of your S3 bucket.Create a compute pool. Note that this pool is not setup to automatically resize. You may also need to pick a larger VM size with a larger dataset.
$ ./shipyard pool add --configdir config.dockstore.md5sum
Submit the job and watch the output (this should take roughly a minute if the pool already exists)
$ ./shipyard jobs add --configdir config.dockstore.md5sum --tail stdout.txt 2017-05-24 14:19:21.543 INFO - Adding job dockstorejob to pool dockstore 2017-05-24 14:19:21.989 INFO - uploading file /tmp/tmp7lgz7_j7 as 'shipyardtaskrf-dockstorejob/dockertask-00012.shipyard.envlist' 2017-05-24 14:19:22.027 DEBUG - submitting 1 tasks (0 -> 0) to job dockstorejob 2017-05-24 14:19:22.090 INFO - submitted all 1 tasks to job dockstorejob 2017-05-24 14:19:22.090 DEBUG - attempting to stream file stdout.txt from job=dockstorejob task=dockertask-00012 Creating directories for run of Dockstore launcher at: ./datastore//launcher-e849c691-cc47-4bfa-a443-b8830794ae0a Provisioning your input files to your local machine Downloading: #input_file from https://raw.githubusercontent.com/briandoconnor/dockstore-tool-md5sum/master/md5sum.input into directory: /mnt/batch/tasks/workitems/dockstorejob/job-1/dockertask-00012/wd/./datastore/launcher-e849c691-cc47-4bfa-a443-b8830794ae0a/inputs/ce735ade-8c46-4736-a7d8-2fc0cb7d2e87 [##################################################] 100% Calling out to cwltool to run your tool ... Final process status is success Saving copy of cwltool stdout to: /mnt/batch/tasks/workitems/dockstorejob/job-1/dockertask-00012/wd/./datastore/launcher-e849c691-cc47-4bfa-a443-b8830794ae0a/outputs/cwltool.stdout.txt Saving copy of cwltool stderr to: /mnt/batch/tasks/workitems/dockstorejob/job-1/dockertask-00012/wd/./datastore/launcher-e849c691-cc47-4bfa-a443-b8830794ae0a/outputs/cwltool.stderr.txt Provisioning your output files to their final destinations Uploading: #output_file from /mnt/batch/tasks/workitems/dockstorejob/job-1/dockertask-00012/wd/./datastore/launcher-e849c691-cc47-4bfa-a443-b8830794ae0a/outputs/md5sum.txt to : s3://dockstore.temp/md5sum.txt Calling on plugin io.dockstore.provision.S3Plugin$S3Provision to provision to s3://dockstore.temp/md5sum.txt [##################################################] 100%
You can repeat the process with
config.dockstore.bwa
which is a more realistic bioinformatics workflow from the PCAWG project and takes roughly seven hours.