Best Practices for Secure and FAIR Workflows

This comprehensive document contains best practices for developing secure tools or workflows that also exemplify the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles.

Version Control Best Practices

  • Host your source code, workflow descriptor file, and Dockerfile in a git repository. Dockstore currently supports GitHub, BitBucket, and GitLab. We recommend GitHub because the GitHub App integrates easily with Dockstore. If you are new to using version control, you can start with these introductory documents:

  • Create an organization on a git repository and have your collaborators publish their peer-reviewed tools or workflows within the organization. (Here are instructions for GitHub).

    • Organizations can centralize your work and help to foster a culture of peer review through Pull Requests.

    • Submitting to an organization rather than hosting on an individual account provides a fallback for others if you become inactive on the git repository site.

  • Plan your repository structure

    • The repository should include the workflow language descriptor file(s), the Dockerfile used to create a custom container (if applicable), a license, and a thorough README.md.

    • Here are examples of nicely organized repositories for workflow development:

  • Use branches to separate the development of distinct features for your workflow.

    • There should always be at least one ‘main’ branch that points to the most stable copy of your workflow.

    • Any new development of features, optimizations, etc., should be created on a new branch/version that diverges from the main branch.

      • If developing multiple new features simultaneously or if multiple people are creating content, work should be split into separate branches.

      • It’s best to split into branches by independent feature units, ex: “add-QC-before-alignment.”

      • Once your feature is stable, create a pull request to merge the branch into your main branch. Once merged, you can delete the development branch if no longer needed.

      • Note on GitHub repository and Docker image versioning: Many workflow repositories will contain both a Dockerfile, with instructions for building the Docker image, in addition to the workflow descriptor file(s) (e.g., .cwl, .wdl, .nexflow, etc.). This adds complexity when tags for Docker images mirror tags for the GitHub repository (as is possible using quay.io, for example). On a development branch, you may want the task to refer to a development version of the Docker image (e.g., quay.io/my_account/my_image:develop). This means that a perfectly-functioning development branch commit could become “incorrect” after being merged into the master branch (because the descriptor file task(s) will be referring to the development Docker image version rather than an immutable version. The best current solution is to update the descriptor file just prior to (or during) the pull request so that the tasks reference the digest format of the Docker image (e.g., quay.io/my_account/my_image:f63e020c4062e0be80831a50de8640).

  • Publish releases of workflow to save your work at a stable version for publication and citation. On GitHub, these are ‘tags’ (learn how to manage tags). Below, we discuss how such releases can become immutable when synced with the snapshots feature on Dockstore.

Image / Container Best Practices

  • Because anyone can publish an image in a public repository (Docker Hub, Quay, etc.), you should be cautious of third-party containers because they may contain malware or insecure software or may have insecure settings. These may result in cryptojacking. See an example of a malicious image in this GitHub repo.

  • When creating custom images, we recommend starting with official images. This way you know that you are starting with a secure base since these images are maintained to remove vulnerabilities.

  • You may find helpful images from sources such as BioContainer that maintains images for 1K+ bioinformatics tools. We cannot guarantee that BioContainer images are secure, so we recommend you scan all non-official images for vulnerabilities. Tools such as Snyk and Trivy scan containers for security concerns.

  • If you detect a vulnerability in a container you are interested in, we suggest you 1) contact the maintainer to update the image, or 2) if there is a Dockerfile, use it as a template to update the image yourself. Try inspecting the Dockerfile and only include those parts you feel are trustworthy. Consider upgrading versions of packages as they may be a source of vulnerabilities.

  • Use Dockerfiles to describe and configure images:

  • Keep images light:

    • More packages increase risks; try to avoid installing unnecessary packages in your images. That being said, starting with a very bare image (such as Alpine) may lead to a long setup or difficulties in debugging.

    • Images tagged with “-slim” contain the minimum components needed to run without being as strict as Alpine-based images. They can often provide a happy medium between a reduced size, enhanced security, and usability.

    • Some helpful starting images are suggested below:

    • A good rule of thumb is that each image should have a specific purpose. Avoid installing all of the software you need for an entire analysis in one container, instead use multiple containers.

    • Don’t include test data inside the image. Recommendations for hosting test data alongside your workflow can be found in the section below titled Accessible.

  • Publish your pre-built image in an open-source container registry (such as DockerHub or Quay.io):

    • Automate builds using an image registry that is configured to trigger a build whenever a change is pushed to the Dockerfile source control repository.

    • Similar to our suggestion to publish your workflow under a GitHub organization, publish your images in an organization on a container registry. Additionally, this may make it easier for your institute to pay for a group plan to ensure your images never expire.

  • Limitation on and expiration of images: DockerHub has announced policies around pull limits as well as their intention to expire DockerHub images that haven’t been pulled for some defined period of time (At the time of writing this, Dockerhub has delayed this policy). For example, this could mean that a workflow that hasn’t been run in some period of time may no longer be reproducible if the image has been removed.

  • Alternative options include:
    • Hosting the image on a different repository such as Google Container Repository, Quay.io, GitHub Packages, AWS ECR, etc.

    • Using images from paid organizations on DockerHub.

    • Paying for a DockerHub account (this may be more cost-effective if you’re able to create an organization with multiple accounts).

    • DockerHub offers exceptions to some open source projects that you may be able to get depending on your use case.

    • Migrating images to another repository to mitigate the impact of DockerHub pull request limits (see example).

Tool / Workflow Best Practices

Findable

  • Once your workflow is ready to share with the community, publish it in Dockstore.

  • When publishing on Dockstore, include robust metadata. Dockstore parses metadata that enables search capabilities for finding your tool/workflow more easily. Metadata also helps your workflow be more reusable. Essential metadata fields include:

    • Naming:

      • Keep the workflow name short.

      • Use all lowercase letters for compatibility with other platforms such as DockerHub.

    • Authorship, contact information, and description:

      • You can add author and description metadata to your descriptor file. Adding an author will make it selectable on the Author facet in Dockstore’s search and a description helps because the text search uses it as one of the fields to sift through.

    • Link GitHub repository:

      • Additionally, for workflow languages that include meta sections, you can include a URL to your original GitHub repo README in the meta section of your descriptor file(s). In case of multiple descriptor files, use the primary descriptor file to host this information. You may consider doing so especially when you have additional comprehensive README files available on GitHub. If your workflow is downloaded or copied from Dockstore to be run on a different computing environment such as, a local machine or HPC, the URL will help connect it with the original source code.

    • Include Dockstore labels to enhance searchability.

  • Above, we discussed the value of organization features in version control and container registries. You can also share your workflow in a Dockstore Organization and Collection. This feature can, for example, showcase workflows that group together to make a complete analysis.

Accessible

  • Publishing your tool or workflow in Dockstore promotes accessibility:

    • Dockstore does not require a user to sign in to search published content, which increases transparency and usability to a greater audience.

    • Dockstore implements its own REST API and also a standardized GA4GH API that can be used for sharing tools and workflows.

  • Use Dockstore’s snapshot feature to provide an immutable release of your workflow that can be verified.

    • Dockstore archives important metadata associated with a published and snapshotted version of tool or workflow to ensure provenance

    • See Dockstore’s best practices for snapshots, including adding a description and metadata to improve searchability and usability of your workflow.

  • Mint a snapshot of your workflow with a Digital Object Identifier (DOI).

    • Users can request a DOI (generated via Zenodo) for their workflow through Dockstore.

    • DOIs enhance reproducibility and make it easier to cite a specific version of your workflow in a publication.

Interoperable

  • Wrap your pipeline in one or more workflow languages supported by Dockstore:

  • Provide a parameter file (JSON or YAML) containing example parameters used for launching your workflow.

    • The parameter file is where you should link to open access test data for your tool or workflow (learn more in Reusable).

    • You can submit multiple parameter files so consider sharing one for a local run (you can use the Dockstore CLI to launch tools and workflows locally) as well as examples for a launch-with partner (such as BioData Catalyst or AnVIL).

  • Provide a checker workflow.

    • Checker workflows are additional workflows you can associate with a tool or workflow. The purpose of them is to ensure that a tool or workflow, given some inputs, produces the expected outputs on a platform different from the one where you are developing.

    • Providing a checker workflow gives other researchers confidence that they can run the work on their system correctly.

Reusable

  • Best practices when referencing the image from the image repository is to provide the digest format of the image as an immutable record in the tool or workflow. Here is an example of a digest format referenced in a workflow task:

task digestDocker {
        command {
                echo "hello world"
         }
        runtime {
        docker:"pkrusche/hap.py@sha256:f63e020c4062e0be8d081a50de16562f2ba161166e896655868efdb5527a8640
        }
}
  • The examples below show how not to reference a container in a workflow task. These formats can change and cause the workflow to no longer be reproducible.

Do not reference parameterized images:

task paramterizedDocker {
        input {
                String docker_image
        }
        command {
                echo "hello world"
        }
        runtime {
        docker: docker_image
        }
}

Do not reference by version, e.g. “v1”.

task VersionDocker {
        command {
                echo "hello world"
        }
        runtime {
                docker: "pkrusche/hap.py:v1.0"
        }
}

Do not use untagged or “latest”.

task latestDocker {
        command {
                echo "hello world"
        }
runtime {
        docker: "pkrusche/hap.py:latest"
        }
}
  • Provide open access test data with your published workflow. Test data can be shared as inputs in a JSON.

    • As mentioned in Image / Container Best Practices, test data should be hosted outside of the container.

      • GitHub can host small files such as csv or tsv (for example: trait data).

      • Broad’s Terra platform hosts multiple genomic files in this open access Google bucket.

    • Consider providing both a full sample run and a small down-sampled development test.

      • A small development dataset is necessary for checker workflows. It also helps others explore your workflow without incurring heavy resource/computational costs.

      • A full-sized sample is helpful for benchmarking your workflow and providing end-users with realistic compute and cost requirements.

  • When writing your descriptor files, do not import remote descriptors using HTTP(s), nor use scripts outside of the container as input files. These practices decrease reusability and increase security risks.

  • Provide a permissive license such as the MIT License, or choose a license that best fits your needs. It can be a text file in the git repository where the workflow is published (see this example).

  • Provide a thorough README in the git repository. Here is an example of thorough documentation.

    • We suggest including the following sections:

      • An introductory description of the goal of the analysis.

      • A pipeline summary that includes the software packages used by the pipeline.

      • A quick start guide that includes inputs and outputs and specifies which inputs are required versus optional.

      • Relevant links to external resources, such as expanded documentation.

      • Contact information for the organization or individual pipeline maintainer.

      • Any available cost or benchmarking information.

      • How to cite the use of your workflow (including references for the original software authors).

  • Note: Documentation can be housed at either or both, the metadata section of the workflow file and/or the GitHub README document. On Dockstore, if a description is provided in the metadata section, it will be displayed on the INFO tab. If the metadata section is missing, Dockstore will display the README on the INFO tab.

  • More information about authorship metadata can be found here: Authorship Metadata