WES Tutorial for running WDL workflows through the Cromwell Engine on AWS AGC

Note

Although the process for running CWL workflows through the Toil engine on AWS AGC is very similar to the process described in this tutorial, there are a few differences. For information on how to launch a CWL workflow on AWS AGC please consult this tutorial.

Amazon Web Services’ (AWS) Amazon Genomics CLI (AGC) is a command line tool for launching cloud infrastructure within AWS accounts that can be used to execute genomics workflows. The infrastructure deployed by AGC implements the WES standard, and thus can be directly used by the Dockstore CLI.

Check out the official AGC Info page.

For developers, check out the official AGC GitHub page.

Download and Install AWS AGC

AGC provides a quick-start guide for initial setup and getting familiar with the tool. The following workflow execution tutorials will cover all steps for using AGC once it has been installed.

Tutorial Topics

The following WDL workflow tutorials will cover:

  1. Deploying an AGC project and context

  2. Configuring the Dockstore CLI to communicate with AGC infrastructure

  3. Launching a workflow

Note

  1. AWS AGC requires a special JSON file to be passed as the test parameter file for workflows, this JSON is formatted as follows:

    {
        "workflowInputs": <pathToInputJSON>
    }
    
  2. AWS AGC references descriptor files by URLs. The Dockstore CLI references the primary descriptor in an execution request as a GA4GH Tool Registry Service (TRS) URL. The AGC infrastructure will only be able to access the file referenced by the TRS URL if the workflow is published (i.e. public) on Dockstore.

  3. On AWS AGC, the test parameter file for a WDL workflow running in Cromwell must reference other files using S3 URLs. Other types of URLs, including https:, are not supported.

Configuring AGC and the Dockstore CLI

  1. Create a file named agc-project.yaml that contains:

    name: dockstoreAgcTutorialProject
    data:
      - location: s3://human-pangenomics
        readOnly: true
    schemaVersion: 1
    contexts:
      ctx1:
        engines:
          - type: wdl
            engine: cromwell
    

This will create an AGC project named dockstoreAgcTutorialProject, with a single context named ctx1. The section containing

data:
  - location: s3://human-pangenomics
    readOnly: true

configures the AGC contexts in this project to be able to read the AWS S3 bucket human-pangenomics. This is an open-access data bucket that will be used for one of the following example workflows.

Note

For AGC infrastructure to interact with an S3 resource, the desired S3 bucket must be specified in the project’s agc-project.yaml file and your AWS account must already have access to the S3 resource.

  1. Activate AGC on your account. If this is your first time running AGC on an account, this may take a few minutes.

    agc account activate
    
  2. Deploy an AGC context by running the below command in the same directory as agc-project.yaml. This will take approximately 10 minutes.

    agc context deploy ctx1
    
  3. Retrieve the WES endpoint created by the context. This will return a few values, the WES endpoint is the value under WESENDPOINT:

    agc context describe ctx1
    
    WESENDPOINT     https://example123.execute-api.us-west-2.amazonaws.com/prod/
    

5. Copy the WES endpoint into the Dockstore CLI config file located at ~/.dockstore/config and append ga4gh/wes/v1 to the end of the URL. Your Dockstore CLI config file should have a named AWS profile included to allow the CLI to authorize requests to AWS. The resulting config file will look similar to:

[WES]
url: https://example123.execute-api.us-west-2.amazonaws.com/prod/ga4gh/wes/v1
authorization: aws-wes-profile
type: aws

6. To verify that the Dockstore CLI is communicating with the AGC infrastructure, list the WES server info. A JSON response will be printed to your terminal with the server’s configuration.

dockstore workflow wes service-info

Note

At this point, the AGC infrastructure is deployed and the Dockstore CLI has been configured.

The AGC context and Dockstore configuration file do not need to be modified for the remainder of these examples, and will continue to function until the resources are modified and/or destroyed.

Hello World Workflow

The Dockstore entry associated with this workflow can be found here agc-hello-world.

This WDL workflow prints out the string “Hello from AGC” as its output.

Dockstore.wdl

version 1.0
workflow w {
    call hello {}
}
task hello {
    command { echo "Hello from AGC" }
    runtime {
        docker: "ubuntu:latest"
    }
    output { String out = read_string( stdout() ) }
}
  1. Since this workflow is publicly posted on Dockstore.org, we can quickly launch it by passing the Dockstore CLI the entry name and its version:

    dockstore workflow wes launch --entry github.com/dockstore-testing/wes-testing/agc-hello-world:v1.12
    
  2. The above command will return a unique run ID, similar to:

    8e8e9f4b-fb1a-41df-bc37-9396d6f97db5
    

    Copy the run ID and run the following to get the workflow run logs:

    dockstore workflow wes logs --id 8e8e9f4b-fb1a-41df-bc37-9396d6f97db5
    

    The logs returned will look similar to:

    {
      "run_id" : "8e8e9f4b-fb1a-41df-bc37-9396d6f97db5",
      "request" : {
        "workflow_params" : { },
        "workflow_type" : "WDL",
        "workflow_type_version" : "1.0",
        "tags" : null,
        "workflow_engine_parameters" : null,
        "workflow_url" : null
      },
      "state" : "COMPLETE",
      "run_log" : null,
      "task_logs" : [ {
        "name" : "w.hello|e6ce6c0a-ae99-43de-accc-4e43183de73f",
        "cmd" : [ "echo \"Hello from AGC\"" ],
        "start_time" : "2022-03-04T17:19:52.341Z",
        "end_time" : "2022-03-04T17:23:17.196Z",
        "stdout" : "s3://agc-example123-us-west-2/project/dockstoreAgcTutorialProject/userid/userM2QLG/context/ctx1/cromwell-execution/w/8e8e9f4b-fb1a-41df-bc37-9396d6f97db5/call-hello/hello-stdout.log",
        "stderr" : "s3://agc-example123-us-west-2/project/dockstoreAgcTutorialProject/userid/userM2QLG/context/ctx1/cromwell-execution/w/8e8e9f4b-fb1a-41df-bc37-9396d6f97db5/call-hello/hello-stderr.log",
        "exit_code" : 0
      } ],
      "outputs" : {
        "id" : "8e8e9f4b-fb1a-41df-bc37-9396d6f97db5",
        "outputs" : {
          "w.hello.out" : "Hello from AGC"
        }
      }
    }
    

    Notice that the output for task hello of workflow w is “Hello from AGC”.

FastQ Read Counts Workflow

The Dockstore entry associated with this workflow can be found here agc-fastq-read-counts.

This WDL workflow tabulates read counts of the input fastq file.

Dockstore.wdl

version 1.0

workflow fastqReadCounts {

    call countFastqReads

    output {
        File totalReadsFile = countFastqReads.totalReadsFile
    }
}



task countFastqReads {

    input {
        Array[File] inputFastq

        Int memSizeGB = 4
        Int diskSizeGB = 128
        String dockerImage = "biocontainers/samtools:v1.9-4-deb_cv1"
    }

    command <<<

        set -o pipefail
        set -e
        set -u
        set -o xtrace

        READ_COUNT=0

        for fq in ~{sep=' ' inputFastq}
        do
              FILE_COUNT=$(zcat "${fq}" | wc -l )/4
              READ_COUNT=$(( $READ_COUNT + $FILE_COUNT ))
        done

        echo $READ_COUNT > total_reads.txt
    >>>

    output {

        File totalReadsFile  = "total_reads.txt"
    }

    runtime {
        memory: memSizeGB + " GB"
        disks: "local-disk " + diskSizeGB + " SSD"
        docker: dockerImage
        preemptible: 1
    }
}
  1. This workflow takes an array of files as an input. Create a file named input.json in your working directory with contents:

    input.json

    {
        "fastqReadCounts.countFastqReads.inputFastq": ["s3://human-pangenomics/working/HPRC_PLUS/HG005/raw_data/Illumina/child/5A1-24481579/5A1_S5_L001_R1_001.fastq.gz"]
    }
    
  2. As a requirement of AGC input parsing, create a second file named agcWrapper.json in your working directory. This file indicates which WES attachment will be used as the input JSON for the workflow execution step, in this case, input.json is our input file:

    agcWrapper.json

    {
        "workflowInputs": "input.json"
    }
    
  3. Since this workflow is publicly posted on Dockstore.org, we can quickly launch it by passing the Dockstore CLI the entry and input files. File attachments can be specified with the --attach or -a switch:

    dockstore workflow wes launch --entry github.com/dockstore-testing/wes-testing/agc-fastq-read-counts:v1.12 --json agcWrapper.json -a input.json
    
  4. The above command will return a unique run ID, similar to:

    b4e86806-2dc0-4d70-b494-52651e9b3de0
    

    Copy the run ID and run the following to get the workflow run logs:

    dockstore workflow wes logs --id b4e86806-2dc0-4d70-b494-52651e9b3de0
    

    The logs returned will look similar to:

    {
      "run_id" : "b4e86806-2dc0-4d70-b494-52651e9b3de0",
      "request" : {
        "workflow_params" : { },
        "workflow_type" : "WDL",
        "workflow_type_version" : "1.0",
        "tags" : null,
        "workflow_engine_parameters" : null,
        "workflow_url" : null
      },
      "state" : "COMPLETE",
      "run_log" : null,
      "task_logs" : [ {
        "name" : "fastqReadCounts.countFastqReads|XXXXX",
        "cmd" : [ null ],
        "start_time" : "2022-03-04T19:00:15.787Z",
        "end_time" : "2022-03-04T19:00:20.185Z",
        "stdout" : "s3://agc-example123-us-west-2/project/dockstoreAgcTutorialProject/userid/righanseM2QLG/context/ctx1/cromwell-execution/fastqReadCounts/b4e86806-2dc0-4d70-b494-52651e9b3de0/call-countFastqReads/countFastqReads-stdout.log",
        "stderr" : "s3://agc-example123-us-west-2/project/dockstoreAgcTutorialProject/userid/righanseM2QLG/context/ctx1/cromwell-execution/fastqReadCounts/b4e86806-2dc0-4d70-b494-52651e9b3de0/call-countFastqReads/countFastqReads-stderr.log",
        "exit_code" : 0
      } ],
      "outputs" : {
        "id" : "b4e86806-2dc0-4d70-b494-52651e9b3de0",
        "outputs" : {
          "fastqReadCounts.totalReadsFile" : "s3://agc-example123-us-west-2/project/dockstoreAgcTutorialProject/userid/userM2LQJ/context/ctx1/cromwell-execution/fastqReadCounts/b4e86806-2dc0-4d70-b494-52651e9b3de0/call-countFastqReads/cacheCopy/total_reads.txt"
        }
      }
    }
    

5. The output of this workflow is a text file containing a read count. To retrieve the file’s contents, you can navigate to the S3 URL via the AWS console, or copy the file contents using the AWS CLI:

aws s3 cp s3://agc-example123-us-west-2/project/dockstoreAgcTutorialProject/userid/userM2LQJ/context/ctx1/cromwell-execution/fastqReadCounts/b4e86806-2dc0-4d70-b494-52651e9b3de0/call-countFastqReads/cacheCopy/total_reads.txt -

6. When you are finished running workflows on your AGC context, you need to destroy it. Destroy your AGC context by running the below command in the same directory as agc-project.yaml. This will take approximately 20 minutes.

agc context destroy ctx1