How to Run a Cluster on the Amazon Cloud (AWS) for Bioinformatics

Bioinformatics Computing on Amazon with StarCluster
by Oliver; 2014-10-17


This article is no longer up to date

Computing with biological data on the cloud is a hot topic these days (e.g., see PLOS: Biomedical Cloud Computing With Amazon Web Services). I'm lucky. Because I work at a big university, I have access to a large computing cluster. But what if my university didn't have a cluster or I decided to open a little Mom and Pop bioinformatics shop? How could I do all the things I'm used to—run jobs on large datasets, in parallel, on powerful computers? Enter the popular Amazon Web Services (AWS)—pay-per-access computers "in the cloud"—and enter MIT's StarCluster, which makes it possible to set up a cluster of these computers.

Making a Bioinformatics AMI Tailored to Your Lab

If your lab is like ours, you have particular software tools or pipelines, Python packages, R libraries, etc. you rely on heavily. Thus, a good starting point is to make a custom Amazon Machine Image (AMI) that has all of your particular needs right at hand. The AMI is like an OS + pre-installed software which you will be able to access from any EC2 node you launch. To create one, boot off of a prexisting image (like Amazon Linux or Ubuntu), install your custom software (say, with yum or apt-get), and save the image. If you've drunk the Amazon Kool-Aid, this is the beginning of a new era where, instead of the drudgery of installing a program and all of its dependencies, you simply use somebody's fully-provisioned computer—via its AMI—to run your data.

How do we do this in practice? Navigate over to your AWS Console:


and launch an EC2 instance:
EC2 Dashboard >> INSTANCES >> Instances >> Launch Instance
For the purposes of installing stuff, a microinstance is probably sufficient. Choose an AMI. Since we're using StarCluster, we'll start with one of their sanctioned AMIs (if we don't, we'll run into problems later when we try to boot the cluster). In the first step of your instance launch progression:
Step 1: Choose an Amazon Machine Image (AMI) >> Community AMIs
search for starcluster in the Community AMIs section. We chose an Ubuntu-based starcluster image (this page describes how to list the official starcluster images).

Once you've got your instance up and running, ssh into it:
$ ssh -i /path/yourkey.pem
Become the root user:
$ sudo -i
And change to a directory where you want to install stuff:
$ cd /opt/software 
Once you're done installing stuff, click on your instance in the AWS Console and create an image:
EC2 Dashboard >> INSTANCES >> Instances >> Actions >> Create Image
It looks like this:


Check that your AMI was successfully created:
EC2 Dashboard >> IMAGES >> AMIs
If it was, you may now safely terminate your microinstance. However, let's hold off for a second until we create a volume.

Using an EBS Volume to Store Your Data

If your AMI is like an OS + custom software and your EC2 instance is ephemeral, where do you store data you want to persist? The answer is, in an AWS EBS volume. Navigate to:
EC2 Dashboard >> ELASTIC BLOCK STORE >> Volumes
And click on "Create Volumes." Presto! A new hard drive has just slid out of the birth canal. Select your volume and go to:
Actions >> Attach Volume
and attach the volume to your EC2 instance according to its Instance ID (note that the volume has to be in the same region—say, us-east-1a—as your instance). Now you have to follow the instructions given in Making an Amazon EBS Volume Available for Use. Following the docs, the first step is to list block devices:
$ lsblk
You should see something like this:
xvda    202:0    0    8G  0 disk
└─xvda1 202:1    0    8G  0 part /
xvdf    202:80   0  800G  0 disk
In this case, I created an 800G volume, so our device name is xvdf. Now suppose I want to mount this volume on the path /home/ec2-user/Volumes. The protocol is:
$ sudo file -s /dev/xvdf
$ sudo mkfs -t ext4 /dev/xvdf # ONLY IF YOUR VOLUME IS EMPTY
$ mkdir /home/ec2-user/Volumes 
$ sudo mount /dev/xvdf /home/ec2-user/Volumes
Did it work? You should see the volume when you examine your disk space on the file system with a df -h command:
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  1.1G  6.6G  15% /
devtmpfs        489M   60K  489M   1% /dev
tmpfs           498M     0  498M   0% /dev/shm
/dev/xvdf       788G   69M  748G   1% /home/ec2-user/Volumes
Since volumes are persistent entities, it's wise to make at least two—one small volume (say, 50G) for scripts, and one big volume for the current project's data which you may ultimately want to delete. Note that you can't attach the same volume to more than one instance at a time, although StarCluster lets you bypass this in a way.

Starting a Cluster with StarCluster

As we remarked in the introduction, StarCluster is a tool that automates the creation of a cluster outfitted with the Oracle Grid Engine (formerly known as "Sun Grid Engine" or "SGE"). The StarCluster has excellent documentation, which you should glance over before proceeding any further. To install StarCluster, use pip:
$ sudo pip install StarCluster		# use this if you're root
$ pip install --user StarCluster	# use this if you're not root
To use StarCluster, the first step is editing the configuration file:
$ vim .starcluster/config
(See full instructions at StarCluster: Quick-Start.) Set your cluster nodes to use the custom AMI you created in the first part of this tutorial with the NODE_IMAGE_ID key. And mount your scripts and data EBS volumes with the VOLUME_ID and MOUNT_PATH keys. Here's an example of what my mediumcluster template looks like:
[cluster mediumcluster]
VOLUMES=mydata, myscripts 
Which node type should you use? I would consult the pricing page: before you make a decision. I would also set ENABLE_EXPERIMENTAL=True so you can try using StarCluster's Elastic Load Balancer if you like (although this feature currently seems buggy). Once you've finished tailoring your configuration file, you're ready to start the cluster. Here are some useful commands.

Start a cluster called olivercluster:
$ starcluster start -c mediumcluster olivercluster
Once you've started the cluster, you can admire your nodes in the AWS GUI:


Start a cluster with spot instances (here using a $0.35 ceiling bid):
$ starcluster start -c mediumcluster -b 0.35 --force-spot-master olivercluster
Spot instances can save you a lot of money, but they can also die without warning. The spot price fluctuates and whatever it is—be it higher or lower than the current listed price—you pay it until it exceeds your ceiling bid. Before you decide on your ceiling, check the pricing history:
EC2 Dashboard >> INSTANCES >> Spot Requests >> Pricing History
For example, for here's today's m3.2xlarge spot price history:


The list price of this instance is around $0.50 as of this writing, so you'll save a lot of money if you run in us-east-1a during the right time window.

ssh into the head or master node:
$ starcluster sshmaster olivercluster
Kill the cluster:
$ starcluster terminate -f olivercluster
Before we get further, it's a good time to plug the AWS CLI (Command Line Interface). Install it, as the docs instruct:
$ pip install awscli --user  # the user flag installs it locally in ~/.local
and configure it:
$ aws configure
As all unix users know, the GUI is annoyingly cumbersome for some tasks. One example of this is tagging the nodes of your cluster. Tagging can be useful, for example, if multiple users are using your amazon account and you want to track who is spending what by project. Try:
Billing & Cost Management >> Cost Explorer >> Launch Cost Explorer >> Group By: Tag Key
and you can see which projects are costing what, provided you've been diligent about tagging your instances and volumes. One way to tag resources en masse is with the AWS command line tools, e.g.:
$ aws ec2 create-tags --resources i-c234172c i-c334172d i-c034172e i-be341750 i-bf341751 --tags Key=project,Value=myproject
(where I've used 5 specific Instance IDs to demonstrate this command)
Update: A better way is with jtriley's

Example 1: Running a Simple Job on StarCluster

Now that StarCluster is running, and we've sshed into the master node, how do we use it? Recall that the nodes are ephemeral and the only thing that will persist after we terminate the cluster are our volumes. So any output we want to save should be written to the volumes. However, if we're running jobs, they'll be faster if they create intermediate files in the local node storage, rather than doing a lot of wasteful file I/O operations copying things to and from the mounted volume.

Another consideration is, do we have enough space in the local node storage to work? This will depend on what you put in your configuration file. For me, the local node storage is in /mnt/sgeadmin. We can check how much space is available:
$ ssh node001 "df -h /mnt/sgeadmin/"
This is something to keep in the back of your mind as you work. You don't want this space to fill up and compromise your jobs. You can monitor the space of your nodes like so:
$ for i in {1..9}; do echo "***"node00${i}"***"; ssh node00${i} "du -sh /mnt/sgeadmin/"; done
Getting down to brass tacks, warm up by having a look at your nodes:
$ qhost
(If you need an SGE refresher, check out my wiki.) Let's try submitting a very simple script called

mylocaloutputdir=/mnt/sgeadmin  # local node storage
myoutputdir=/data/results       # global storage on mounted vol

# echo some nonsense
echo "Test $1" > ${mylocaloutputdir}/test${1}.txt

# copy file from local storage to mounted vol
cp ${mylocaloutputdir}/test${1}.txt ${myoutputdir}
Because I mounted my data volume on /data, that path is my persistent storage. Let's submit our jobs:
$ mkdir /data/results /data/logs_test 
$ for i in {1..10}; do echo $i; qsub -N job${i} -e /data/logs_test -o /data/logs_test -V -cwd -S /bin/bash ./ $i; done 
Smashing success! We can check on our temporary files with the same trick as above:
$ for i in {1..9}; do echo "***"node00${i}"***"; ssh node00${i} "ls /mnt/sgeadmin/"; done
These should be cleaned but we'll worry about that later. We can also check that the files are saved in /data/results.

Example 2: Mapping Data with Bowtie on StarCluster

We did most of the hard work of learning how to use StarCluster in the previous step. Now let's attack a more bioinformatically relevant task, and map reads with bowtie. Here's a slightly more sophisticated script, which improves on the test script above:

echo [start]
echo [pwd] `pwd`
echo [date] `date`

mate1=$1				# mate 1
mate2=$2				# mate 2
myoutputdir=$3				# global output (volume storage)
name=$4					# unique name
ref=/opt/ref/hg19/genome/bowtie/hg19	# bowtie prefix on our AMI
mylocaloutputdir=/mnt/sgeadmin		# local node storage

# map, convert sam to bam, sort bam
bowtie2 -x $ref -1 $mate1 -2 $mate2 | samtools view -bS - | samtools sort - $mylocaloutputdir/${name}.sorted

# copy from local storage to global storage
cp $mylocaloutputdir/${name}.sorted.bam $myoutputdir/

echo [completed mapping]
echo [date] `date`

# index
samtools index $mylocaloutputdir/${name}.sorted.bam
cp $mylocaloutputdir/${name}.sorted.bam.bai $myoutputdir/

echo [clean]

rm ${mylocaloutputdir}/${name}*

echo [finish]
echo [date] `date`
We can run this script as follows:
$ qsub -N ${name} -e logs_map -o logs_map -V -cwd -S /bin/bash ./ $mate1 $mate2 $outputdir $name; done
where, again, $outputdir refers to a path on our mounted volume and $mate1 and $mate2 are our fastq inputs.

Whatever you're running on the StarCluster, one question to ask yourself is, are you using the nodes efficiently? To monitor usage, try sshing into a node and running htop:


In the above screenshot, my node had 4 cores so I turned on the bowtie multi-threading option to parallelize over all of them (note each is at 100%). To tell your qsub you want to use 4 cores, the flag is:
$ qsub -pe orte 4

What I Don't Like About the AWS Cluster

Lest I sound too much like a fanboy, there are a number of things that I've found painful about running an AWS cluster in this way:
  • Every time you update your AMI's software or git repositories, you have to re-build your AMI and update the AMI ID in your star config file
  • You get billed as long as your nodes are alive, as opposed to just when they're running jobs
  • Starting and stopping new clusters, for testing, is slow (at least 5 minutes in my experience). And every time you update your AMI or want to mount a volume which all your nodes can see, you have to do this
In any case, that's the minimalist introduction to bioinformatics on the cloud—go wild!