Notes: AWS for Bioinformatics

Bioinformatics Computing on Amazon
by Oliver; 2017-08-23
 
aws
 
ec2
         

Introduction

These notes provide some suggestions for running bioinformatics workflows on AWS, with an emphasis on the twin concerns of time and money.

AWS Services

Before we define our workflows, we need to know basic AWS terminology. The AWS ecosystem is divided into services. Two concern us here:

ServiceUse CasePrice
EC2Cloud Computing InstancesDepends on instance. See EC2 Pricing
S3Object StorageRoughly $0.025 per GB per month (in 2017). See S3 Pricing

For EC2 pricing, you can also check this convenient link:
Under the EC2 umbrella, there are various entities we need to be familiar with:

EntityDescription
EC2 InstanceCloud Computing Instance (the instance type determines computing power)
EBS VolumeVolume Storage (like an external HD)
AMIImage (like an OS + pre-installed software)

Software on AWS

When you start an EC2 instance, you build it off an AMI (Amazon Machine Image). This is independent of the instance's computing power, which is mediated by the instance's type (e.g., t2.xlarge). How do we go about managing software on AWS? Any software in heavy rotation, like samtools or bwa, should go in your AMI. You can grab other stuff from GitHub on an as-needed basis.

How exactly do you build an AMI? I recommend starting with the official StarCluster AMI and then building your own on top of it. That way, you have the option to run a cluster, even if you choose not to.

Tip: keep your AMI small and lean! Don’t put big files or genomic references into your AMI. If you do, say, put a genomic reference into your AMI and then run a cluster, this expensive storage space will get multipled by the number of nodes in your cluster. Also, as I learned from experience, it's extraordinarily hard—if not impossible, for the layman—to shrink an AMI once it's built (see StackOverflow: Decreasing Root Disk Size of an “EBS Boot” AMI on EC2).

Storing Data on AWS

So, where does data go? Project data, as well as files necessary to run bioinformatics software (reference genomes, bwa indicies, blast dbs, SNP dbs, vcfs, etc), should go on S3. (If you want archival storage for data you're rarely going to touch, there's another Amazon service called Glacier.) Here's the story of S3 vs EBS in bullet points:
  • EBS Volumes are expensive
  • S3 is cheap
  • But: you can’t compute on S3 (in general)
  • And: you can compute on EBS
  • So: pull S3 stuff onto EBS temporarily, compute, then delete the EBS volume when finished
In other words, S3 is a long-term storage solution; an EBS volume is a short-term storage solution while you're running your jobs.

Tip: To save money, don’t keep big volumes kicking around for a long time after your jobs have finished.

Tip: samtools can work directly with bam files on S3 sans download (see Biostars: Tool for random access to indexed BAM files in S3?).

AWS Computing Workflows

In the following sections I examine the relative merits of 3 sample AWS workflows for running parallel jobs in bioinformatics:
  1. GNU Parallel
  2. AWS Batch
  3. StarCluster

AWS Computing Workflows: GNU Parallel

As the docs say, "GNU parallel is a shell tool for executing jobs in parallel using one or more computers." This workflow is simple: fire up a big EC2 instance with many cores (vCPUs) and run your jobs in parallel with GNU parallel. For example:
$ seq 0 9 | parallel samtools flagstat /Volumes/sample{}.bam &
Advantage: This approach is easy and hassle-free.

Disadvantage: Not as parallelizable as a cluster. If you have thousands of jobs, this is probably not your best bet.

Question: How can I automatically stop my instance when my job finishes?

Suppose you want to run parallel (or some script) and then go home and sleep. You want your instance to stop automatically when your job finishes so you're not charged for it. One way to solve this problem is to kill your node from within your node. To do this, first download the following package:
$ sudo apt install ec2-api-tools
The command ec2stop will stop (not terminate) your instance. Let's say you want to stop your instance after the script shellscript.sh finishes. Then:
$ instanceid=$( curl http://169.254.169.254/latest/meta-data/instance-id )
$ ./shellscript.sh && ec2stop $instanceid -O KEY -W SECRETKEY
The URL in the above statement is a special one from which you can retrieve metadata about instance. See: Also, recall how the double ampersand (&&) works in bash. When there are two commands separated by a double ampersand, the second will only execute if the first executes succesfully. So ec2stop will run after successful execution of whatever's before the && which, of course, could be a command using parallel.

Another way to stop your instance is to use the AWS GUI's CloudWatch. Navigate to your EC2 Dashboard and click on your instance. Go to the Monitoring tab. Click Create Alarm:

image

In this way, you can stop or terminate your instance if it's quiescent for some time.

AWS Computing Workflows: AWS Batch

One option for running parallel workflows on AWS is AWS Batch. See, for example: This approach relies on Docker and, personally, I've never been a big Docker fan. It also features AWS Step Functions "to coordinate the components of your applications using visual workflows." Do you really want to program by dragging and dropping boxes in a GUI? Look at: As this comically long-winded four-part blog unwittingly demonstrates, it's a huge pain in the ass! 😭 They require no less than 7 different Amazon services, hundreds of lines of code, and innumerable configuration files just to get up and running—a clear vote for tediously long and complex over simple and practical. The "Healthcare and Life Sciences Partner Solutions Architects" (suspiciously long title alert?) at AWS have built a glorious Rube-Goldberg machine.

AWS Computing Workflows: StarCluster

Another solution for parallelizing jobs on AWS is running a cluster via MIT's StarCluster. Here's an example setup:

image

Each node gets booted up with your custom AMI, and all the nodes see the mounted volumes. You copy whatever you need from S3 into your volume.

Question: How many jobs can I run?

The docs say:
StarCluster by default sets up a parallel environment, called “orte”, that has been configured for OpenMPI integration within SGE and has a number of slots equal to the total number of processors in the cluster.
I.e., the number of jobs you can run should be:
number of nodes x number of vCPUs

More on StarCluster

Let's deep dive (or, at least, modest dive) into StarCluster.

Question: How do I start, stop, and ssh into the cluster?

Suppose your cluster is called mycluster. The syntax for sshing into it is:
$ starcluster sshmaster mycluster
Terminate it:
$ starcluster terminate mycluster
Stop it:
$ starcluster stop mycluster
Restart it after stopping it:
$ starcluster start -x mycluster

Question: How do I see the nodes of my cluster?

To see your nodes, type:
$ qhost
Submit a test job:
$ mkdir -p logs; qsub -e logs -o logs -N test -b y -cwd sleep 120
To see the status of your jobs, type:
$ qstat
You can play with this:
# How long is your queue? (including header)
$ qstat | wc -l
# How many jobs are in running state?
$ qstat | awk '$5=="r"' | wc -l

Question: How much does it cost?

Here's a purely anecdotal account. I used a 5 node cluster of c3.4xlarge instances. This instance type has 16 vCPUs and 30G of RAM. Thus I had 16 * 5 = 80 slots in which to run parallel jobs. I submitted about 120 array jobs, each with 25 tasks (so 3000 jobs), which took about 5 hours to run. This 5 hour runtime cost between $30 and $35. In a table:

Number of NodesInstance TypeNode SpecsNumber of Job SlotsRuntimeCost
5c3.4xlarge30G, 16 vCPUs805 hours$30 to $35

Tip: Run a command over all nodes

Say you want to run a command on all of your nodes. Maybe you want to source a setup file. Or perhaps you want to list all the files in each node's /tmp directory. Here's an example of how to do that on a 5 node cluster:
# loop through a 5 node cluster
$ for i in master node00{1..4}; do ssh ${i} "ls /tmp"; done

Tip: Delete intermediate files as you go and compress files you want to keep

When running bioinformatics pipelines, a good policy is to delete intermediate files as you go (once you're sure everything is working) and compress files you want to keep. For example, to compress all .txt and .vcf files in myfolder, try:
# compress all text files
$ find myfolder/ -name "*.txt" | xargs gzip
# compress all vcfs
$ find myfolder/ -name "*.vcf" | xargs gzip

Tip: automatically stop instances when jobs finish

A big concern with StarCluster is stopping nodes automatically when the jobs in a batch finish. If you fail to do this, you'll have to manually monitor your jobs or you'll be paying for many idling EC2 instances. Without the ability to automatically stop nodes, StarCluster is not economical and, in my opinion, not worth it. Here are two hacks to do this.

Hack 1: The first hack assumes you have a server-like computer you're using to boot up StarCluster (i.e., a computer that remains on). At work, I have a computer that's always running, so this works well. Instead of sshing into StarCluster like this:
$ starcluster sshmaster mycluster
do it like this:
$ starcluster sshmaster mycluster && starcluster stop mycluster --confirm
So whenever you exit your ssh session, your local computer will automatically stop your cluster. Now all we have to do is automate exiting our session when our jobs finish. To do this, let's write a very simple script, jobmonitor.sh:
#!/bin/bash

while true; do
        queuelength=$( qstat | wc -l )
        date
        echo $queuelength
        if [ "$queuelength" -eq 0 ]; then echo "Queue empty - exiting"; exit; fi
        sleep 120
done
So when the queue is empty, this script will exit. Now we can run our jobs and, when we want to leave and go home, we'll run:
$ ./jobmonitor.sh > monitor.txt && echo "jobs done" && date && exit
So when our jobs finish, this command will exit our session. When that happens, our local computer will resume control and stop our cluster.

Hack 2: The second hack takes the approach we saw above in the GNU parallel section. In this case, you don't need a another server-like computer; after all, your EC2 instances are themselves servers. So we can use one to kill all the rest and then commit seppuku. The first thing we do is to run the following command:
# loop through all nodes
$ for i in $( qhost | sed '1,3d' | tr -s ' ' | cut -f1 -d' ' ); do
    echo -ne ${i}"\t"; ssh ${i} "curl http://169.254.169.254/latest/meta-data/instance-id 2> /dev/null; echo";
done
and save the output in a file I'll call instanceids.txt. This is simply a 2 column file where column one is the hostname (e.g., master, node001, etc.) and column two is the AWS instance id. Now we'll write a script called seppuku.sh:
#!/bin/bash

# Stop all nodes in StarCluster

# Sample Command:
# ./seppuku.sh instanceids.txt KEY SECRETKEY

# path to instance IDs file (two column file: col1 is hostname, col2 is instance id)
instanceids=$1
key=$2
secretkey=$3

# get hostname of current node
myhost=$( hostname )

echo "date: "$( date )
echo "instance file: "$instanceids
echo "host: "$myhost

# the point is to stop whatever instance happens to be running this job last

# stop all nodes except node running this job:
while read host id; do
echo $host;
if [ "$host" = "$myhost" ]; then
        # if current host, save id but don't stop
        myid=$id
        echo "skip"
else
        ec2stop $id -O $key -W $secretkey
fi
done < ${instanceids}

echo $myhost;
# finally, stop node running this job:
ec2stop $myid -O $key -W $secretkey
We submit our jobs, noting the job id of the final job. We qsub seppuku.sh using the -hold_jid flag to make it wait on the final job (or holding on all jobs if you don't know which one will run last). It stops all the instances, stopping the instance on which it itself is running last.

qsubing seppuku.sh can be a little finicky (I've heard bad things about trying to hold on a large number of jobs). So, to tweak this approach, it's probably easier just to run this on the master node:
$ nohup bash -c "jobmonitor.sh > monitor.txt && seppuku.sh instance.ids.txt KEY SECRETKEY" &
Note: in my experience nohup doesn't work unless you use absolute paths.

Tip: Use array jobs

Use array jobs.

Suppose you're parallelizing over chromomsome, denoted by the variable c. A common task is to map the array job variable, SGE_TASK_ID, to c. You can do this in bash with a case statement:
case $SGE_TASK_ID in
	1) c="1";;
	2) c="2";;
	3) c="3";;
	4) c="4";;
	5) c="5";;
	6) c="6";;
	7) c="7";;
	8) c="8";;
	9) c="9";;
	10) c="10";;
	11) c="11";;
	12) c="12";;
	13) c="13";;
	14) c="14";;
	15) c="15";;
	16) c="16";;
	17) c="17";;
	18) c="18";;
	19) c="19";;
	20) c="20";;
	21) c="21";;
	22) c="22";;
	23) c="X";;
	24) c="Y";;
	25) c="MT";;
esac
Advertising

image


image


image


image