Wiki: Bioinformatics

A Sprawling Bioinformatics Wiki Covering Software, File Formats, and Useful Commands
by Oliver; Jan. 13, 2014
   

Introduction

This is a collection of miscellaneous bioinformatics stuff for reference. So you know what to expect, this wiki is more a reminder to myself than a carefully crafted article.

Biotechnology Resources

Illumina, a leading manufacturer of sequencing machines, has an introduction to sequencing worth reading here: An Introduction to Next-Generation Sequencing Technology. I've stolen a figure from this document illustrating the next-generation sequencing workflow:

image
(Image credit: An Introduction to Next-Generation Sequencing Technology)

As the figure says, the four steps are:
  • NGS library is prepared by fragmenting a gDNA sample and ligating specialized adapters to both fragment ends.
  • Library is loaded into a flow cell and the fragments hybridize to the flow cell surface. Each bound fragment is amplified into a clonal cluster through bridge amplification.
  • Sequencing reagents, including fluorescently labeled nucleotides, are added and the first base is incorporated. The flow cell is imaged and the emission from each cluster is recorded. The emission wavelength and intensity are used to identify the base. This cycle is repeated "n" times to create a read length of “n” bases.
  • Reads are aligned to a reference sequence with bioinformatics software. After alignment, differences between the reference genome and the newly sequenced reads can be identified.
A major resource for biotechnology on the web is the National Center for Biotechnology Information.

BLAST (discussed below) is a famous sequence aligner you can use online.

The UCSC Genome Browser (discussed below) provides a graphical user interface for browsing the human genome, as well as the genomes of other species. The Exome Aggregation Consortium Browser provides a searchable database of variants from "a wide variety of large-scale sequencing projects." DAVID is a tool for analyzing lists of genes.

Biostars and SEQanswers are bioinformatics-community forums which post questions and answers in the manner of Stack Overflow. They tend to pop up when you Google bioinformatics questions.

What Is a Read?; Mapping

Modern sequencing technologies break DNA or RNA into many shorter fragments, and it is these short fragments—reads in sequencing parlance—which are sequenced. For example, a read of 20 nucleotides might look like this:
AGCTCAGCAAAGCTTCGCTC
Typically, reads are longer than this, although the precise length is dependent on the sequencing platform. A giant file of raw reads isn't the most useful thing, so typically reads have to be processed—they are the input to bioinformatic programs or pipelines. A simple starting point is to map or align the reads, which means matching them to a pre-existing reference by sequence homology. This will tell us the read's location. For example, perhaps a hypothetical read maps to chromosome 1 of the human genome reference at position 2345. Then, after alignment, you might see something like this:
chr1 	 2345 	 AGCTCAGCAAAGCTTCGCTC
for the particular read.

Your set of reads is unlikely to agree perfectly with the reference to which you're mapping it, since members of the same species don't have identical genomes. Mapping is a way to quantify both the similarities and the differences vis-à-vis the reference. This difference can take the form of single nucleotide variants (SNVs) and insertions/deletions (indels). Larger structural rearrangements, such as copy number variations and chromosomal translocations, can be more challenging to catch.

Two important concepts related to mapping are paired-end reads and coverage depth. Coverage depth is simply the number of reads that align on a particular spot of the reference genome:
image
In this figure, for example, the coverage depth at the postion of the black triangle is four.

The Illumina sequencing introduction linked above describes paired-end reads:
A major advance in NGS technology occurred with the development of paired-end (PE) sequencing. PE sequencing involves sequencing both ends of the DNA fragments in a sequencing library and aligning the forward and reverse reads as read pairs. In addition to producing twice the number of reads for the same time and effort in library preparation, sequences aligned as read pairs enable more accurate read alignment and the ability to detect indels, which is simply not possible with single-read data. Analysis of differential read-pair spacing also allows removal of PCR duplicates, a common artifact resulting from PCR amplification during library preparation. Furthermore, paired-end sequencing produces a higher number of SNV calls following read-pair alignment. While some methods are best served by single-read sequencing, such as small RNA sequencing, most researchers currently use the paired-end approach.
Their figure:

image
(Image credit: An Introduction to Next-Generation Sequencing Technology)

We mentioned the alignment software Blast above but for high coverage, paired-end, short reads—i.e., what comes out of most modern sequencing machines—you'll want to avoid Blast and use a program tailor-made for this job, such as bwa.

Assembly

What if you have sequencing data for an uncharacterized bacteria or transcriptome? Or what if you want to produce your own reference genome? In all of these cases, the protocol is sequence assembly, which is the process of stitching your reads back together—i.e., forming the sequences your reads comprised before they were sheared into fragments. When two or more reads are joined, the longer sequence they produce is called a contig. Assembly is usually a very computationally-intensive process and there are many programs to choose from, as listed on the Wikipedia page. After your assembly is finished, you've made a reference and other people can map their reads to it. Various big bioinformatics outfits regularly roll out new and improved reference assemblies of the human genome and those of other species (see, e.g., The Genome Reference Consortium). One side note about assembly is that there is no answer key—no way to check how close you got to the right answer.

The Genome

For the purpose of getting a simple model in your head, think of the genome as having two distinct regions: genes and intergenic regions. Continuing the oversimplification, think of genes as having two kinds of components: exons, the coding part, and introns, the non-coding part. Here's a figure from Wikipedia:

image
(Image credit: Wikipedia: Exon)

Current DNA sequencing experiments often focus only on the total set of exons, known as the exome (although whole-genome sequencing is becoming more common).

DNA vs RNA Sequencing Data

The fundamental difference between DNA-seq and RNA-seq data is that the latter contains transcripts which are the product of splicing—the exons we met above can join in various combinations to form mRNA. This makes dealing with RNA data more of a challenge: trying to map it onto a genomic reference will be harder, since two contiguous exons in an RNA transcript may be separated by an intron in the genomic DNA. Furthermore, a DNA-seq dataset should have sequences representing the whole genome (although, in practice, often just the exome is captured), whereas an RNA-seq dataset will only contain the particular transcripts expressed in your population of cells (e.g., liver cells are going to look different than brain cells). Because humans are diploid, a heterozygous variant at a particular position will be present in roughly 50% of the covering reads in DNA-seq data. In RNA-seq, however, this signal will be much noisier because coverage is related to transcript expression and all the complicated regulatory machinery that goes along with it. It's easier to call variants in DNA sequencing data, but RNA sequencing data has other dimensions: you can use it for analyses of gene expression, gene fusions, and pathogen discovery. As my coworker eloquently put it, RNA is like a way station—if you want to express protein, you have no choice but to pass through it, so the genetic material of the organism as well as that of any virus infecting it will show up in the RNA.

Hat tip: Sak

Fastq and Fasta Formats

Unaligned, short reads often come in fastq ("fast Q") format. In this file format, there are 4 lines per read: read ID; sequence; a plus sign; and sequence quality. For example, the following 12 lines comprise three sequences:
@CWS01-ILLUMINA:9:FC:1:1:4115:1677 1:Y:0:GGCTAC
NGCGCCGGGCGGGGCGAAAGCGGGCGTCATGGGCGCAGCTTCAGGCGAATGTAGAGGAAATTGACCGCCGCGGCGG
+
############################################################################
@CWS01-ILLUMINA:9:FC:1:1:4153:1672 1:Y:0:GGCTAC
NGCGGAAAGCGGCTGGGGGGTGCGGGGACGGCGCGCGCACGGAGGGGGCTCGCCGGCGCCGCGGGCAGGCGGGCGC
+
############################################################################
@CWS01-ILLUMINA:9:FC:1:1:4197:1676 1:N:0:GGCTAC
NGAAAAAGGCGGCAGCAGCATGGTTGGTGGCGGGGAGCATGTGCGCCGCGGCGCACGCGCAAGGCACGGGGACGCT
+
#,.,*///./@@@@@@@2@@<<<:<@@@@@##############################################
A more general format for storing sequencing data is fasta ("fast A"). The pattern for this format is >Sequence ID followed by the sequence, spanning any number of lines. Note the identifier always starts with a > character. Compared to fastq format, fasta format is suitable for longer sequences. For example, here are two sequences in fasta format:
>c0_g1_i1
GTTTCATACTTGCTCTGAATGGCGTTGCTGAGTCCAGACCAGCATCCGCGGTGCCAGCGC
CGGGCACCGCAGGGGACCTGAGTCCTGCCCCAGTGTCAAGCCGCGGTTCTGTTATCCGTC
ACCACGACAACCAAGCACCAGACCTGCCGGGTGTTCGGCAGTGGCAGTGACAGCTCAGAC
CCGCTCAGACGCAGGGTAGTGAGCCGGGCGGCTCTGGTG
>c2_g1_i1
TTGTATTCAAATTCATTCCTAGTCCTATTGGCTCTTTTATTCTTTCCACGGAATCTTCCG
TTGTGTGTACTTTTTATCTTTTCTTTAGAATTTCTTTTGAATTTAATTAATACCATAAGG
AGCCTAGTGATTCCCAACTGTTTACAAATTTTGCCTTGTCAATGAGATTTTAAATTCTTT
TCCTAAGGACTTAGTTGGGGCATAGTAAGAAATTGTAATTGATTGATAAAAAAAAACCAT
Note that there's no sequence quality in this format. You'll usually find reference genomes or contigs produced by an assembly in fasta format, while reads fresh off the sequencing machine come in fastq.

Alignment and SAM Format

As we remarked above, alignment, also called mapping, is the process of matching reads to a pre-existing reference by sequence homology. Aligned reads are stored in yet another file format known as sam (for Sequence Alignment/Map), which includes both the read and reference sequences and information about where and how they differ. The compressed form of a sam file is known as a bam file. Here's a typical workflow:
fastq file of reads --> align to reference --> bam file
A bam file can be indexed to produce a corresponding bai index file, as you can read about on on BioStars.

Here's what the first 10 lines of a sam file looks like (you can scroll to the right):

image

Quoting directly from the man page, the columns are:
sam
1Query template NAME
2bitwise FLAG
3Reference sequence NAME
41-based leftmost mapping POSition
5MAPping Quality
6CIGAR string
7Ref. name of the mate/next read
8Position of the mate/next read
9observed Template LENgth
10segment SEQuence
11Phred-scaled base QUALity+33
If you're mapping to the human genome, you can find this reference here. There are many programs to align reads, such as and bowtie, bwa (DNA), and STAR (RNA).

Bowtie and BWA

As we remarked above, bowtie—now officially bowtie2—and bwa are two popular aligners. They excel at mapping short reads from high-throughput sequencing data.

Bowtie

To use bowtie, you must first index your reference:
$ bowtie2-build ref.fa prefix
(It will create files whose names begin with prefix.)
To map paired reads (bowtie2 version 2.1.0):
$ bowtie2 -x /path/to/ref/prefix -1 read1.fq -2 read2.fq > aln-pe.sam
To quote the manual directly, the general usage is:
Usage: 
  bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

  <bt2-idx>  Index filename prefix (minus trailing .X.bt2).
             NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
  <m1>       Files with #1 mates, paired with files in <m2>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <m2>       Files with #2 mates, paired with files in <m1>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <r>        Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <sam>      File for SAM output (default: stdout)

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.
Oftentimes, your goal with mapping is produced a sorted bam file along with its bai index. If you're smart, you can use bash pipes to do this without creating a lot of polluting intermediate files. E.g.:
$ # map, convert sam to bam, sort bam
$ bowtie2 -x $ref -1 $mate1 -2 $mate2 | samtools view -bS - | samtools sort - $mylocaloutputdir/${name}.sorted
This will produce a file called $mylocaloutputdir/${name}.sorted.bam.

BWA

Index your reference:
$ bwa index ref.fa
For single end reads:
bwa aln ref.fa short_read.fq > aln_sa.sai
bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam
For paired reads:
bwa aln ref.fa read1.fq > aln_sa1.sai
bwa aln ref.fa read2.fq > aln_sa2.sai
bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam
Generate a bam file, instead of a sam file, with bwa:
bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq | samtools view -bS - > aln-pe.bam
For long reads:
bwa bwasw ref.fa long_read.fq > aln.sam
Here's an example of a script to automate bwa mapping. It assumes the fastq files are zipped and uses 8 cores:
#!/bin/bash

echo [start]
echo [pwd] `pwd`
echo [date] `date`

mate1=$1
mate2=$2
myoutputdir=$3
name=$4
ref=$5

echo 1
bwa aln -t 8 $ref <( cat $mate1 | gunzip ) > ${myoutputdir}/$name.1.sai
echo 2
bwa aln -t 8 $ref <( cat $mate2 | gunzip ) > ${myoutputdir}/$name.2.sai
echo 3
bwa sampe $ref ${myoutputdir}/$name.1.sai ${myoutputdir}/$name.2.sai <( cat $mate1 | gunzip ) <( cat $mate2 | gunzip ) | samtools view -bS - > ${myoutputdir}/${name}.bam

# now sort and index bam file

echo [finish]
echo [date] `date`

Samtools

As we remarked above, sam format is a common file structure in which to store aligned sequence data. Samtools is a popular suite of software tools for manipulating sam and bam files. Here are some basic samtools (version 0.1.19) commands.

View the first one hundred lines of a bam file on the command line:
$ samtools view myfile.bam | head -100 | column -t | less -S
Note the:
column -t | less -S
combination, which makes viewing jagged multi-column files in the terminal easy.

Convert bam to sam:
$ samtools view myfile.bam -h > myfile.sam
The -h flag preservers the header, which is necessary for converting sam back to bam.

Get a particular region of a bam file, e.g. chr20:
$ samtools view -h myfile.bam chr20
Sam to bam:
$ samtools view -bS myfile.sam > myfile.bam
Sort bam:
$ samtools sort myfile.bam myfile.sorted
Index a bam file (to produce a .bai file):
$ samtools index myfile.sorted.bam
Get the coverage depth of a bam file:
$ samtools depth myfile.sorted.bam > myfile.sorted.depth.txt
Index a fasta file (to produce a .fai file):
$ samtools faidx myfile.fa
Pull out a subsequence from an faidx-indexed fasta file by range:
$ samtools faidx myfile.fa chr1:100000-100005
Generate a bam file such that all reads meet the criteria "the read is paired in sequencing" and "the read is mapped in a proper pair":
$ samtools view -b -h -f 0x0001 -f 0x0002 myfile.bam > myfile.goodreads.bam
A one-liner to split your bam file by chromosome (scroll horizontally to see the full command):
$ for i in {1..22} {X,Y,MT}; do echo "* "$i; mkdir -p chr/${i}; samtools view -bS <( samtools view -h in.bam $i ) > chr/${i}/out.${i}.bam; done
Samtools has an ASCII alignment viewer which allows you to see how reads mapped to the reference right in your terminal. (First make sure that you've indexed your .bam file so that the .bai file is in the same directory.) Syntax:
$ samtools tview $bam $ref
Then type g to go to a position or ? for help.

Here's an example of what it looks like:

image

View reads covering a particular position (e.g., chr9:711187):
$ samtools tview $bam $ref -p chr9:711187-711187 -d t
We'll talk about pileup format below, but here's how to generate it with samtools for a particular region (e.g., chr9:711187-711287):
$ samtools mpileup -d 100000 -L 100000 -r chr9:711187-711287 -f $ref $bams
The mpileup command has a lot of flags and you need to pay careful attention to them. Here, we're using:
  • -d INT max per-BAM depth to avoid excessive memory usage [250]
  • -L INT max per-sample depth for INDEL calling [250]
By default, the command only considers a depth of 250 reads. This is quite shallow, so if you have deep sequencing data, it's worth increasing it so you won't miss reads. In our case, 1000 is a safe number. But you might still ask, where are all my reads? Again, by default, samtools discards "anomalous read pairs"—meaning, say, one mate mapped and one mate didn't—and discards "bases with baseQ/BAQ smaller than 13." If, for some reason, you want to see all your reads, even the lousy ones, run:
$ samtools mpileup -A -Q 0 -d 100000 -L 100000 -r chr9:711187-711287 -f $ref $bams

Samtools Flags

The second column in a sam file is the flag. The flag gives you information about how your read mapped, but to decode this information you have to convert the flag from base-10 to base-2. For example, if your flag is 99, then in binary it's:
1100011
How do we read this? Quoting from the Samtools man page:

BitHexidecimalDescription
10x1template having multiple segments in sequencing (i.e., read paired)
20x2each segment properly aligned according to the aligner
(i.e., read mapped in proper pair)
40x4segment unmapped (i.e., read unmapped)
80x8next segment in the template unmapped (i.e., mate unmapped)
160x10SEQ being reverse complemented
320x20SEQ of the next segment in the template being reverse complemented
640x40the first segment in the template
1280x80the last segment in the template
2560x100secondary alignment
5120x200not passing filters, such as platform/vendor quality controls
10240x400PCR or optical duplicate
20480x800supplementary alignment

So, for 99 == 1100011, the 20, 21, 25, and 26 bits are all high—i.e., it's paired (1); it's mapped in proper pair (2); the mate is on the reverse strand (32); and the first mate in the pair (64). As we saw in the previous section, we can filter on these qualities. For example, to get all the first mate reads, it's:
samtools view -f 64 file.bam
To get the second mate, it's:
samtools view -f 128 file.bam

File Formats in Which to Store Variants: mpileup, vcf, maf

If you have multiple bam files which you want to compare and all are mapped to the same reference, you can put them in mpileup format. To emphasize this point, bam files have information for a single sample, while pileup format can accommodate an arbitrary number of samples merged. A pileup file contains all the information in the bam files displayed "vertically." It gives the reference base per position as well as the corresponding read bases and read base qualities for all samples. It might look like this:

chrposref basecov depthread basesread base quals
115026C6,$..,,,HD?E@9
115027T6..,,,,DD>A:C
115028C4.,,,EC;7
115029G5.,,,,AE?>C
115030T5.,,,,9CC=?
115031G5A,,,,5FAAE
115032G4.,,,>F5<
115033C5.,,,,BB<<D
Here's a screenshot of a two sample mpileup file (you can scroll right):

image

The columns for a single sample are:
mpileup
1chromosome
21-based coordinate
3reference base
4the number of reads covering the site
5read bases
6base qualities
You add three columns—the number of reads covering the site, read bases and base qualities—for each additional sample. The periods and commas in the read bases strings represent reference-matching alleles, while letters like A, C, T, and G represent variants. As always, consult the man page for precise information about how to interpret these fields.

Pileup format isn't particularly human-readable, so there's one more format translation you need to do, into Variant Call Format (vcf). Here's the idea:
multiple bam files --> mpileup --> vcf
Here's what a vcf file looks like (scroll right):

image

Quoting directly from here, the columns are:
vcf
1CHROMthe chromosome
2POSthe genome coordinate of the first base in the variant.
Within a chromosome, VCF records are sorted in order of increasing position
3IDa semicolon-separated list of marker identifiers
4REFthe reference allele expressed as a sequence of
one or more A/C/G/T nucleotides (e.g. "A" or "AAC")
5ALTthe alternate allele expressed as a sequence of
one or more A/C/G/T nucleotides (e.g. "A" or "AAC").
If there is more than one alternate alleles,
the field should be a comma-separated list of alternate alleles
6QUALprobability that the ALT allele is incorrectly specified,
expressed on the the phred scale (-10log10(probability))
7FILTEREither "PASS" or a semicolon-separated list of failed quality control filters
8INFOadditional information (no white space, tabs, or semi-colons permitted)
9FORMATcolon-separated list of data subfields reported for each sample.
A double pound symbol (##) precedes description or comment lines in the header and a single one (#) precedes the actual column header. Here's an example of ## lines:
##fileformat=VCFv4.2
##source=pileup2multiallele_vcf
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting reads">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting reads">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 0">
##FORMAT=<ID=FREQ,Number=1,Type=Float,Description="Variant allele frequency">
##FORMAT=<ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting reads">
##FORMAT=<ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting reads">
##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting reads on forward strand">
##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting reads on reverse strand">
##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting reads on forward strand">
##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting reads on reverse strand">
##SnpSiftVersion="SnpSift 4.1c (build 2015-03-29), by Pablo Cingolani"
##SnpSiftCmd="SnpSift annotate dbSnp138.vcf res/report.vcf"
##INFO=<ID=CLNDSDBID,Number=.,Type=String,Description="Variant disease database ID">
##INFO=<ID=PH3,Number=0,Type=Flag,Description="HAP_MAP Phase 3 genotyped: filtered, non-redundant">
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=G5A,Number=0,Type=Flag,Description=">5% minor allele frequency in each and all populations">
##INFO=<ID=CFL,Number=0,Type=Flag,Description="Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies.">
##INFO=<ID=KGValidated,Number=0,Type=Flag,Description="1000 Genome validated">
Whatever sub-fields are in the INFO or FORMAT field should be described here. After the nine fixed fields of the vcf, you can have an additional field for each sample, whose format is specified by the FORMAT column. The INFO sub-fields generally describe features that hold true across all samples, while the FORMAT sub-fields give you information per sample.

Here are some examples of sub-fields of format field:
DPDescription="Quality Read Depth of bases with Phred score >= 15"
RDDescription="Depth of reference-supporting bases"
ADDescription="Depth of variant-supporting bases"
RBQDescription="Average quality of reference-supporting bases"
ABQDescription="Average quality of variant-supporting bases"
RDFDescription="Depth of reference-supporting bases on forward strand"
RDRDescription="Depth of reference-supporting bases on reverse strand"
ADFDescription="Depth of variant-supporting bases on forward strand"
ADRDescription="Depth of variant-supporting bases on reverse strand"
There's an alternative to vcf known as Mutation Annotation Format (maf). Why are there two formats, not one standard? Just to make your life a living hell! In my opinion, vcf format is better, cleaner, and simpler and—as a consequence—has better software tools written for it. I never use maf but I'll include it for completeness:
maf
1Hugo_Symbol
2Entrez_Gene_Id
3Genome center
4NCBI_Build
5Chromosome
6Start_Position
7End_Position
8Strand
9Variant_Classification
10Variant_Type
11Reference_Allele
12Tumor_Seq_Allele1
13Tumor_Seq_Allele2
14dbSNP_RS
15dbSNP_Val_Status
16Tumor_Sample_Barcode
17Matched_Norm_Sample_Barcode
18Match_Norm_Seq_Allele1
19Match_Norm_Seq_Allele2
20Tumor_Validation_Allele1
21Tumor_Validation_Allele2
22Match_Norm_Validation_Allele1
23Match_Norm_Validation_Allele2
24Verification_Status
25Validation_Status
26Mutation_Status
27Sequencing_Phase
28Sequence_Source
29Validation_Method
30Score
31BAM_File
32Sequencer
33Tumor_Sample_UUID
34Matched_Norm_Sample_UUID
As you can see, its gzillion fields make it unwieldy and inconvenient, which is why you should stick to vcf.

Working with VCF Files

View a vcf file in the terminal (cutting off header lines):
$ cat file.vcf | sed '/##/d' | column -t | less -S
Index tab-delimited files, such as those in vcf format, with tabix for fast searching:
$ bgzip file.vcf                       # produces file.vcf.gz
$ tabix -p vcf file.vcf.gz             # index vcf file with tabix (make .tbi file)
$ # now we can grab ranges super fast, say chr 10 pos 370800 to 370900
$ tabix file.vcf.gz 10:370800-370900   
Note that this uses samtools' bgzip, not a standard unix utility like bzip2. Once you've indexed a vcf file in this way, you can easily pull out any region—say, chromosome 10, positions 370800-370900—as shown above.

There are Perl scripts for manipulating vcfs called VCFtools, but these can be slow with large vcfs. vcflib, faster than VCFtools, is "a simple C++ library for parsing and manipulating VCF files, + many command-line utilities".

Extract particular fields from the INFO-field:
$ vcfkeepinfo test.vcf DP4 MQ > test.small.vcf
Extract particular fields from the GENOTYPE-field (i.e., FORMAT-field):
$ vcfkeepgeno test.vcf RBQ ABQ SDP AD > test.small.vcf
In this particular case, we cut the reference and variant quality fields, and the total depth and variant depth.

Calling Variants

Calling variants is a phrase which can mean two things in bioinformatics: (1) simply enumerating differences between a reference and a sample; and (2) determining which of those differences are significant. These are two very different questions. In both cases, we follow the pattern discussed above:
bam files --> mpileup file --> vcf
Enumerating the differences is straightforward; finding the significant ones is difficult. Here are three directions in which to pursue this problem.

bcftools

Call variants with mpileup + bcftools (version 0.1.19). E.g.:
$ samtools mpileup -D -S -d 5000 -r 1 -uf ref.fa 1.bam 2.bam | bcftools view -bvcg -p 2  - | bcftools view - > out.vcf
These commands are highly dependent on the flag options. Here are some of the options:
Output options:

       -d           the max per-BAM depth to avoid excessive memory usage [250]
       -D           output per-sample DP in BCF (require -g/-u)
       -g           generate BCF output (genotype likelihoods)
       -O           output base positions on reads (disabled by -g/-u)
       -s           output mapping quality (disabled by -g/-u)
       -S           output per-sample strand bias P-value in BCF (require -g/-u)
       -u           generate uncompress BCF output

VarScan

Call variants with mpileup + VarScan. E.g.:
$ samtools mpileup -D -S -d 5000 -r 1 -f ref.fa 1.bam 2.bam | java -Xmx2G -jar VarScan.v2.3.6.jar mpileup2cns --output-vcf 1 --p-value 2 > out.vcf

Problems with bcftools and VarScan

As of this writing, bcftools and VarScan have some shortcomings. You can read a full critique about it here, but in brief:
  • bcftools doesn't call multi-allelic variants
  • bcftools doesn't give per-sample depths for multiple samples
  • VarScan crashes on 0-depth positions in the pileup file
  • VarScan doesn't produce a well-formed vcf as output
  • VarScan on occasion gives depths that are simply wrong
  • VarScan mishandles multi-allelic variants
One way to avoid these problems is to use our program mpileup2vcf.

Blast

Blast, for Basic Local Alignment Search Tool, is a robust and versatile aligner which can handle short as well as very long sequences. As the name says, it's a local aligner which means that, no matter how large your query and subject (reference) sequences are, it will find all the matching subsequences. It also means the alignment can be degenerate—one piece of the query sequence can map, non-uniquely, to many different places on the reference sequence.

Blast has an online GUI but if, say, you want to make your own blast databases or run batches of jobs, it's worth downloading the command line version. Example command line blast syntax:
$ blastn -task megablast -db /db/prefix -outfmt 6 -query input.fa -word_size 28 -evalue .002
$ blastn -task megablast -db /db/prefix -outfmt 6 -query input.fa -evalue 1e-30 -max_target_seqs 5
The header is:
1qseqid
2sseqid
3pident
4length
5mismatch
6gapopen
7qstart
8qend
9sstart
10send
11evalue
12bitscore

Index a Fasta File to Use as a Blast Database

To make a reference for blast, you have to index it. E.g.:
$ makeblastdb -in ${input_fasta} -dbtype nucl -out ${name} -parse_seqids -hash_index

Downloading NCBI's Blast NT Database

The Blast NT database comprises a broad swath of sequence data across all life—"All GenBank + EMBL + DDBJ + PDB sequences," as the docs describe it. It's the most comprehensive repository of sequences there is and what you should blast to if you don't know what species your query sequence is. To download:
$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*
Then untar.

Making a Blast Subset Database

Once you've downloaded the NT database you can make a subset of it corresponding to, say, viruses or bats or anything else. To do this, you first need to get a list of the GI asscessions of all the sequences in the taxa you want. Go to NCBI Nucleotide and refine your search results using the Taxonomic Groups box on the right. Download the GIs as a text file by selecting: "Send to: File" and "Format: GI List". Here's a screenshot showing the process of downloading all bat GIs:

image

It will be called something like sequence.gi.txt. Then use the blastdb_aliastool to make a subset db:
$ blastdb_aliastool -db /path/to/blast/database/nt -dbtype nucl -gilist sequence.gi.txt -title YourTitle -out YourName

Querying the NT Database from the Command Line

To see the "head" of your NT database in fasta format, use:
$ blastdbcmd -db ${ntdb} -entry all -outfmt '%f' | head
To query the taxid and description for an entry in the NT database for an individual GI accession, use:
$ blastdbcmd -db ${ntdb} -entry ${gi_accession} -outfmt '%T %t' -target_only
For a batch of GI accessions, use:
$ blastdbcmd -db ${ntdb} -entry_batch ${gi_accession_file} -outfmt '%T %t' -target_only

bl2seq

bl2seq is a command in the blast suite for pairwise blasting 2 sequences. E.g.:
$ bl2seq -p blastn -i seq1.fasta -j seq2.fasta

Blat

Blat is a BLAST-like tool developed by UCSC. It's often much faster than BLAST. Example command line blat syntax:
$ blat ref.fa query.fa output.psl
The output is in UCSC's psl format.

Get blast style output:
$ blat ref.fa query.fa output.psl -out=blast
Get tabular blast style output:
$ blat ref.fa query.fa output.psl -out=blast8

UCSC Genome Browser

UCSC Genome Browser is a very popular and rightly-famous tools for visualizing reference genomes (i.e., assemblies) and all sorts of annotation overlaying them. It was written by Jim Kent. If you're ambitious, you can try installing a local copy of the browser on your own server to host your own assemblies—read about how I did it here.

SRA Toolkit

If you use NCBI's Sequence Read Archive (SRA), you can download raw sequencing data as files with an .sra extension. To convert these into fastq, download the SRA toolkit and run its fastq-dump:
$ fastq-dump myfile.sra --gzip --split-files > log.o 2> log.e
(the flags output your file in compressed form as well as one file per mate.)

IGV

Integrative Genomics Viewer (IGV) is a highly regarded and widely used tool for visualizing, among other things, reads mapped to reference. For example, you can load bam files into it and view the coverage depth of reads on the reference at any particular window on the genome.

If you have a list of variants which you believe are real, it's a good idea to look at them in IGV to make sure the supporting reads don't look funny. Here's a screenshot of two bam files mapped to the human genome (hg19) displayed in IGV. At the position of the cursor, the reference is a C but some of the reads are calling a T:

image

Bedtools

Bedtools is a suite of utilities for so-called genomic arithmetic—e.g., finding intersections between features in annotated sets of sequence data. One thing I hate about it: Bedtools follows the incredibly stupid and insanely confusing UCSC genome browser convention of representing coordinates with zero-based counting where the end coordinate is not included. Here's an example of this nonsense, using Bedtools' getfasta command, which "extracts sequences from a FASTA file for each of the intervals defined in a BED file":
$ cat junk.fa
>junk
ACTGGAACCT
$ cat junk.bed
junk    2       5
$ bedtools getfasta -fi junk.fa -bed junk.bed -fo out.fa
$ cat out.fa 
>junk:2-5
TGG

GFF Format

GFF, or Generic Feature Format, is the standard format for describing or annotating sequence features. Read about it here.

FastQC

FastQC is a quality control tool for high throughput sequence data. It provides nice graphical displays of quality metrics for your fastq files and alerts you if, for example, your reads are low quality or particular k-mers are overrepresented. For this reason, it's a good idea to run this on your fastqs before you map them to nip any potential problems in the bud. Here's the simple syntax:
$ fastqc -o outputdir -f fastq myfile.fastq
Here's one of the many graphs it produces, average per base sequence quality:

image

Cutadapt

Cutadapt is a program for cutting junk adapters out of sequencing data.

Trinity

Trinity is a de novo RNA seq assembler—that is to say, you input reads and it outputs a transcriptome. One of the best things you can say about Trinity is that, unlike many other tools in the field, it's super easy to use. As with all assemblers, it takes a significant amount of time and memory for large data sets.

Cufflinks

In contrast to Trinity, Cufflinks is a reference-based RNA seq assembler—that is to say, you input reads plus a reference genome assembly and it outputs a transcriptome. It's the most famous program of its kind, although using it has a learning curve.

Command Line Bioformatics: Pulling Out Subsequences from a Fasta File by Range

Here's how to pull out subsequences from a fasta file, such as the reference genome hg19, by range. The only trick is that you have to index your fasta with samtools faidx first. Here, we're going to look at the sequences flanking three random positions in chromosome one, 100000, 2349483, 534900, by 10 base pairs on either side (check out the cool combination of awk and xargs!):
$ echo -e "100000\n2349483\n534900" | awk '{print "chr1:"$1-10"-"$1+10}' | xargs -i echo "samtools faidx /path/hg19.fa {}" | sh        
>chr1:99990-100010
tcatcacactcactaagcaca
>chr1:2349473-2349493
TGTCCTGCCCCCAAACGCGAA
>chr1:534890-534910
gttggctaggctggtctcgaa

Command Line Bioformatics: Converting Fastq to Fasta

$ cat myfile.fastq | awk '{if (NR%4==1) {print ">"substr($1,2)} else if (NR%4==2) {print $0}}'

Command Line Bioformatics: Putting the Sequence Part of a Fasta File on a Single Line

$ cat myfile.fasta | awk 'BEGIN{f=0}{if ($0~/^>/) {if (f) printf "\n"; print $0; f=1} else printf $0}END{printf "\n"}'
(from my co-worker, Vladimir)

Command Line Bioformatics: Reverse Complementing a Sequence

$ echo TTACGGT | rev | tr ATGC TACG
Courtesy of this blog.

Command Line Bioformatics: Computing the Sequence Composition (i.e., how many A,C,G,Ts) in a Fasta File

One way:
$ for i in A C T G; do cat myfile.fasta | grep -v ">" | grep -o ${i} | wc -l; done
A better way from my co-worker, Albert:
$ cat myfile.fasta | grep -v ">" | fold -w 1 | sort | uniq -c

Command Line Bioformatics: Generating Random Sequence Data

Via my co-worker, Albert:
$ cat /dev/urandom | tr -dc 'actg' | fold -w 50 | head

Command Line Bioformatics: Extracting a Subset of a Fasta File from a File of IDs

Suppose we have a fasta file, test.fa:
>id1
AGTCCC
>id2
AA
>id3
TTTTC
>id4
CCCCAGTCCC
>id5
ACC
as well as a file of ID names, test.ID.txt:
id2
id5
To extract the entries of our fasta file such that the ID exists in our text file, try:
$ cat <( cat test.ID.txt | awk '{print ">"$1}' ) <( cat test.fa | paste - - )  | sort | awk '{ if (NF==1) {key=$1} else if ($1==key) {print $1"\n"$2 > "found.fa"} else {print $1"\n"$2 > "notfound.fa"}}' 
How does this work? Let's break the statement up by pipes. The first part adds the > symbol in front of the IDs from the file, puts the entries of the fasta file onto a single line, and cats them together. The next pipe sorts them, so you get something like this:
>id1    AGTCCC
>id2
>id2    AA
>id3    TTTTC
>id4    CCCCAGTCCC
>id5
>id5    ACC
The final pipe is a bit of awk, which breaks this into two files, found.fa and unfound.fa, based on whether there is a repeated ID line.

Command Line Bioformatics: Partitioning the Genome into Ranges

The goal is to turn this file, hg19.chrom_24chrs.sizes:
chr1    249250621
chr2    243199373
.
.
.
into a series of ranges:
chr1:1-50000000"
chr1:50000001-100000000"
chr1:100000001-150000000"
chr1:150000001-200000000"
chr1:200000001-249250621"
chr2:1-50000000"
chr2:50000001-100000000"
chr2:100000001-150000000"
chr2:150000001-200000000"
chr2:200000001-243199373"
.
.
.
Here's a one liner to do it:
cat hg19.chrom_24chrs.sizes | perl -ne '{chomp($_); $myline = $_; @a=split(/\t/, $_); $mychr=$a[0]; $mydiv = 50000000; $mynum = $a[1]; $myquo = int($mynum/$mydiv); $myremainder = $mynum%$mydiv; if (1==0) {print $myline."\t".$myquo."\t".$myremainder."\n"}; $range2=0; for (my $i = 1; $i <= $myquo; $i++) { $range1=$i*$mydiv-$mydiv+1; $range2=$i*$mydiv; print $mychr.":".$range1."-".$range2."\n";} if ($myremainder>0) {$newrange1=$range2 + 1; $newrange2=$range2+$myremainder; print $mychr.":".$newrange1."-".$newrange2."\n"}}' 
Made readable:
$ cat hg19.chrom_24chrs.sizes | perl -ne '{
	chomp($_); 
	$myline = $_; 
	@a=split(/\t/, $_); 
	$mychr=$a[0]; 
	$mydiv = 50000000; 
	$mynum = $a[1]; 
	$myquo = int($mynum/$mydiv); 
	$myremainder = $mynum%$mydiv; 
	if (1==0) {print $myline."\t".$myquo."\t".$myremainder."\n"}; 
	$range2=0; 
	for (my $i = 1; $i <= $myquo; $i++) 
	{
		$range1=$i*$mydiv-$mydiv+1; $range2=$i*$mydiv; print $mychr.":".$range1."-".$range2."\n";
	} 
	if ($myremainder>0) 
	{
		$newrange1=$range2 + 1; 
		$newrange2=$range2+$myremainder; 
		print $mychr.":".$newrange1."-".$newrange2."\n"
	}
}' 
Advertising

image


image


image