How to Install the UCSC Genome Browser Locally

A Guide to Custom Installing UCSC's Genome Browser
by Oliver; Jan. 13, 2014
         
sql
     

Introduction

This article is no longer up to date: Please refer to the official documentation here


The UCSC Genome Browser is one of the best known tools in bioinformatics—and rightly so: it's powerful, fast, and awesome. Everybody loves it, including me. Sometime ago, our lab worked on the annotation of some recently assembled genomes which were not available on the UCSC Genome Browser. The Kent source code, which powers the genome browser, is free on git:
git clone git://genome-source.cse.ucsc.edu/kent.git
Hence, we reasoned, if we installed the program on our own web server we would be free to customize it as much as we liked—giving our own assembled references and annotation a permanent home on the internet. At the time, I was ignorant of the Assembly Hubs function, but that's just as well because I like to do things my own way.

I've sung the genome browser's praises but installing the damn thing is like trying to solve a Rubik's cube in a knife fight. You can install the Kent source code, but it's hard to figure out how to proceed linearly through the constellation of READMEs, which are of varying degrees of helpfulness. Half of them are in the source tree; half of them are online; and half of them don't exist. You know you're in a bit of trouble when the best documentation for the process comes from two random bloggers: To its credit, the code gives descriptive error messages in most cases, which give you a hint about what went wrong. What follows is a lurid account of the installation process to help you if you ever have to do it. The goal is to understand it just enough to be able to upload and annotate your own reference genomes.

Overview

What's the big picture? As with a lot of fancy web programs, the genome browser runs on an unholy potpourri of programming languages: C, MySQL, HTML, Javascript, etc. A simple sketch of the program's architecture, so far as I could decipher it, is as follows:

image

Of course, this diagram is quite wrong—there are a lot more things going on and the SQL database doesn't really feed directly back to your browser, but it's infinitely better to have a simple picture in your head than none at all.

Here's an overview: there are scripts or binaries that sit in a cgi-bin/ directory*. When a web user goes to the genome browser webpage, he's going to click a link that makes one of the binaries in cgi-bin/ run. The gateway, or controller, binary is cgi-bin/hgGateway. Taking cues from a configuration file, cgi-bin/hg.conf, it talks to a MySQL server which has many databases on it. Some databases are required by the program's machinery and some—like the database for each reference genome—are not. The gateway database, which knows about all the other databases, is called hgcentral.

Each individual genome database has tables required by the program, such as chromInfo and trackDb, as well as many tables which represent annotation and may have started their life as .bed files. While a genome's database has lots of annotation stored in tables, it does not include large monster files. Those are stored as flat files on the server, by default in the /gbdb path. For example, the fasta file for a reference genome is compressed into UCSC's 2bit file format. The database will know the path to this file rather than store the file itself.

* If you don't know what this is, this is a specially designated directory on web servers where scripts are allowed to run. Normally, when you access a webpage, it's just a text file marked up into HTML—nothing fancy. However, if the server is to permit some web user to run a script on it, it has to be done with care. A way of doing this is putting any such script in a designated directory, often named cgi-bin.

Installation and Important Directories

To install the package, refer to the README, which you can find online here or at kent/src/product/README.building.source (along with other READMEs) in the Kent source tree.

There are a couple of major directories to keep straight in your head throughout the installation process:
  • kent/src
  • cgi-bin
  • trash
  • ~/bin/$MACHTYPE (in my case: ~/bin/x86_64)
  • gbdb
  • /path/to/html
kent/src is wherever you downloaded the kent source tree. It has nothing to do with the immediate functioning of your web genome browser, so you can put this directory wherever you like. In the source tree you'll find lots of tools you'll need later. If $KENT is the full path to kent, for example, you should see directories $KENT/src/utils, $KENT/src/product, $KENT/src/hg/makeDb/hgLoadBed, and a million more.

Tip: Since there's a lot of stuff in the source tree, a particularly useful unix utility to use when following the installation instructions is find. For example:
$ find $KENT -name "ex.hg.conf" # find the example configuration file
cgi-bin is wherever your server's cgi-bin is located. In this directory, you should see binaries like hgGateway, hgTracks, hgTables, and so on. The main configuration file for the whole show resides in this directory:
cgi-bin/hg.conf
You might have to copy the example configuration file into this spot and modify it. However, you shouldn't start changing it until you've set up your MySQL databases.

trash is a directory required by the program. You must make this folder in the same folder where your cgi-bin resides and give it relaxed permissions. Image files the web browser makes get stashed here and you have to remove the folder's contents periodically.

~/bin/$MACHTYPE is a folder in your home directory where compiled binaries from the source tree get deposited. Early in the help docs, you're told to run this command:
$ export MACHTYPE=x86_64 # for my system
$ # check what system you're using with the command "uname -a"
Once you've done this, you can go into various directories in the source tree and make utilities that will land in ~/bin/$MACHTYPE. As the docs suggest, try
$ cd $KENT/src/utils/fixCr
$ make
It should produce ~/bin/$MACHTYPE/fixCr. It's also a good idea to add ~/bin/$MACHTYPE to your PATH.

gbdb is where you store data for each reference genome you plan to host. For a given genome, most of the annotation will be stored in its MySQL database (more about this later). However, as we've said before, some things, like the reference genome 2bit file, will reside as flat files on your server. They go in this path, which will be coded into the MySQL database. One of the really stupid things about the genome browser is that the default location of this path is in the root directory:
/gbdb
and this is hard-coded into the MySQL databases in a number of places. Most researchers working a shared computing environment will not have root access, so you may be asking yourself—what kind of troglodyte would hardwire an immutable path in the root directory in his software package? However, the problem can be solved by manually updating some of the tables in MySQL, most crucially hgcentral.dbDB. In fact, if you're just uploading your own references rather than setting up a mirror, it's no problem at all. We'll discuss it more below.

/path/to/html is the path to the genome browser's HTML files. You should see stuff like index.html, mirror.html, staff.html, etc. in this directory. You must set this path in the browser.documentRoot parameter of your configuration file, hg.conf (we'll discuss this file below). However, if the directory you use is not your server's root HTML directory (i.e., the one you land on when you type in your web address), the program will give you difficulties. This is another woefully stupid feature of the code—since you probably have a pre-existing index.html you don't want to overwrite with the genome browser's—but the problem can be solved by making symbolic links. We'll discuss it in the section on troubleshooting.

MySQL Databases and Users

Starting with MySQL, you need to download stuff to create various databases the genome browser needs. This was already done in my case, but some clue about how to do it may be found here: Once this is done, you should have a bunch of databases on your SQL server. For example, you might have
+--------------------+
| Database           |
+--------------------+
| information_schema |
| dm3                |
| hg19               |
| hgFixed            |
| hgcentral          |
| mm9                |
| proteins090821     |
| proteome           |
| uniProt            |
| visiGene           |
+--------------------+
Per order of the program, you also need three MySQL users (note: a MySQL is different than a Unix user), each with a different permission spectrum:
  • ucsc_admin
  • ucsc_readwrite
  • ucsc_read
For the ucsc_admin user, you want as much privilege as possible. He will be responsible for tweaking hgcentral, adding annotation, etc. If you can, you want to do something like this:
GRANT USAGE ON *.* TO 'ucsc_admin'@'%.myserver.edu' IDENTIFIED BY PASSWORD 'my_password1';
GRANT ALL PRIVILEGES ON *.* TO 'ucsc_admin'@'%.myserver.edu';
For the ucsc_readwrite user, you need to enable read and write permissions on hgcentral only. Something like this:
GRANT USAGE ON `hgcentral`.* 
 TO 'ucsc_readwrite'@'%.myserver.edu' IDENTIFIED BY PASSWORD 'my_password2';
GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, ALTER ON `hgcentral`.* 
 TO 'ucsc_readwrite'@'%.myserver.edu';
Finally, for the ucsc_read user, you need to enable read permissions on all databases. Something like this:
GRANT USAGE ON *.* TO 'ucsc_read'@'%.myserver.edu' IDENTIFIED BY PASSWORD 'my_password3';
GRANT SELECT ON *.* TO 'ucsc_read'@'%.myserver.edu';

Updating hg.conf and .hg.conf

The next step is updating the configuration files to add these users' credentials. Updating ~/.hg.conf (create this file if it doesn't exist) is easy. It just has to know about the ucsc_admin user. It should look like this:
db.host=myserver.edu
db.user=ucsc_admin
db.password=my_password1
This file isn't used by the cgi-bin binaries. It's used to by binaries in the kent source tree which do things like create MySQL databases or tables automatically. ~/.hg.conf has to have very particular permissions:
rw-------
or these binaries will complain until you set it this way.

The config file cgi-bin/hg.conf is integral to the main show and it has to be correct if your web browser is to function. It has to know about the ucsc_read and ucsc_readwrite user, not the admin user. There's a lot in cgi-bin/hg.conf but the only lines you need to pay attention to are:
db.host=myserver.edu
# db.user is the username is use when connecting to the specified db.host
# it needs read-only access.  The browser CGIs do not need
# read-write access to the database tables
db.user=ucsc_read
db.password=my_password3

db.port=3306
defaultGenome=Human

# trackDb table to use. A simple value of `trackDb' is normally sufficient.
# In general, the value is a comma-separated list of trackDb format tables to search.
db.trackDb=trackDb

# track group table definitions.  This is a comma-separated list similar to
# db.trackDb that defines the track group tables.
db.grp=grp

# central.host is the name of the host of the central MySQL
# database where stuff common to all versions of the genome
# and the user database is stored.
central.db=hgcentral
central.host=myserver.edu

# Be sure this user has UPDATE AND INSERT privs for hgcentral
central.user=ucsc_readwrite
central.password=my_password2
central.domain=http://mysite.edu

# Change this default documentRoot if different in your installation,
# to allow some of the browser cgi binaries to find help text files
# browser.documentRoot=/usr/local/apache/htdocs
browser.documentRoot=/path/to/html
Once you've changed this file to reflect your settings, move on to the next step.

Preparing a Reference Organism to Upload

There's a good description of how to do this here: Following the suggested convention, let's call your organism's genome build abdDef1. We could redraw our imperfect schematic as:
image
(Still imperfect.)

Following the instructions in the link, we're going to work in the path /my/path/gbdb/abcDef1. Here's a digest of my commands:
$ cd /my/path/gbdb/abcDef1
$ faToTwoBit chr_all.fa abcDef1.2bit
$ hgFakeAgp -minContigGap=1 chr_all.fa abcDef1.agp
$ checkAgpAndFa abcDef1.agp abcDef1.2bit > checkagp.out &
$ twoBitInfo abcDef1.2bit stdout | sort -k2nr > chrom.sizes
$ mkdir -p bed/{chromInfo,gc5Base}
$ awk '{printf "%s\t%d\t/my/path/gbdb/abcDef1/abcDef1.2bit\n", $1, $2}' chrom.sizes > 
   bed/chromInfo/chromInfo.tab
$ hgGcPercent -wigOut -doGaps -file=stdout -win=5 -verbose=0 abcDef1 abcDef1.2bit | 
   wigEncode stdin bed/gc5Base/gc5Base.{wig,wib} &
$ hgsql abcDef1 < $KENT/src/hg/lib/grp.sql
$ hgLoadSqlTab abcDef1 chromInfo $KENT/src/hg/lib/chromInfo.sql bed/chromInfo/chromInfo.tab
$ hgGoldGapGl abcDef1 abcDef1.agp
$ mkdir html
$ mkdir wib
$ ln -s $( readlink -m bed ) wib/
$ hgLoadWiggle -pathPrefix=/my/path/gbdb/abcDef1/wib abcDef1 gc5Base bed/gc5Base/gc5Base.wig
You have to compile these programs from the source tree before you can use them. Then in MySQL we have to update a couple of tables in hgcentral. Do something like this:
mysql> use hgcentral;
mysql> INSERT INTO dbDb 
 (name, description, nibPath, organism, defaultPos, active, 
  orderKey, genome, scientificName, htmlPath, hgNearOk, hgPbOk, sourceName, taxId) 
 VALUES 
 ("abcDef1", "2013", "/my/path/gbdb/abcDef1", "M. species", "chr1:10000-11000", 1, 
  124, "M. species", "Species Name", "/my/path/gbdb/abcDef1/html/description.html", 
  0, 0, "new genome version 1.0", 1);
mysql> INSERT INTO defaultDb (genome, name) VALUES ("M. species", "abcDef1");
mysql> INSERT INTO genomeClade (genome, clade, priority) VALUES 
 ("M. species", "mammal", 124);
This ensures your new reference genome will be found.

A Quick Tour of the MySQL Databases

We saw some example databases above. You probably have at least:
  • hgcentral
  • hgFixed
  • hg19
The database hgcentral contains information about every other database on your server in hgcentral.dbDb.nibPath. (If it doesn't, you have to update it as we saw last section). It's the gateway or address book. Another table, genomeClade, controls what items appear in the genome pop up menu on the main Genome Browser Gateway webpage. hgcentral looks like this:
+---------------------+
| Tables_in_hgcentral |
+---------------------+
| blatServers         |
| clade               |
| dbDb                |
| dbDbArch            |
| defaultDb           |
| gdbPdb              |
| genomeClade         |
| hubPublic           |
| hubStatus           |
| liftOverChain       |
| sessionDb           |
| targetDb            |
| userDb              |
| wikiTrack           |
+---------------------+
Every reference genome you want to display on the browser will have its own database, such as bosTau7 for the cow's 2011 assembly (note the recommended camel casing). Within each genome's database, there are required tables with particular names, as well as annotation tables that can have any name. The annotation tables are created by uploading tracks, which we'll discuss below. For your new reference genome, abcDef1, the database should look something like this:
+------------------------+
| Tables_in_abcDef1      |
+------------------------+
| Custom_Annotation_1    |
| Custom_Annotation_2    |
| chromInfo              |
| gap                    |
| gc5Base                |
| gold                   |
| grp                    |
| hgFindSpec             |
| history                |
| trackDb                |
+------------------------+
The trackDb contains information about Custom_Annotation_1, Custom_Annotation_2, and all the other tracks you make. We'll see how to make these below.

Troubleshooting the Directory Structure

I had a problem at this point, which was that the graphics in the genome browser were all out of alignment, overlapping badly, and off kilter:

image

All of this stemmed from the following issue: I had set
browser.documentRoot=/path/to/html
yet this was not the location of my HTML root directory. That was somewhere else:
/path/to/ROOT_html
This made sense, given we already had folders in ROOT_html, like style, js, images, that had the same names as folders in the genome browser's HTML directory. However, as noted before, the genome browser package wants browser.documentRoot in hg.conf to be your real HTML root. If this is not the case, you'll get the jumbled effect I was seeing. What's the solution, if browser.documentRoot isn't the real root path? The answer is that various folders must be in both places, and you do that with symbolic links:
/path/to/html/js -> /path/to/ROOT_html/js
/path/to/html/img -> /path/to/ROOT_html/img
/path/to/html/cgi-bin -> /some/path/cgi-bin
/path/to/html/trash -> /some/path/trash
So, confusingly, browser.documentRoot doesn't quite work as advertised. You can set it to be whatever you want but js/ and img/ MUST be in the root directory (the real folders, not links). Move them there. They MUST also be in the browser.documentRoot path, /path/to/html, but as links, not real folders. The folders cgi-bin and trash should also reside as links in the browser.documentRoot path. The real folders can be anywhere your server allows.

Adding Annotation: Making Tracks and configuring trackDb.ra

There's a moderately useful guide for how to do this here: To add annotation information, such as the position of exons or gene expression data, you need to
  • format your annotation data correctly
  • load it into the appropriate genome database
  • modify the trackDb.ra file and load it into the database's trackDb table
Since a .bed file is an easy place to start, let's use this format as an example. It's described here: The first three columns are standard but the rest, which are optional, are UCSC's own convention. To quote:
  1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
  5. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). This table shows the Genome Browser's translation of BED score values into shades of gray:
  6. strand - Defines the strand - either '+' or '-'.
  7. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
  8. thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
  9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
  10. blockCount - The number of blocks (exons) in the BED line.
  11. blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
  12. blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
The real work is twisting your data into this special format. Once you've done this, you're almost there. Working in the path /my/path/gbdb/abcDef1, you want to do something like this:
$ mkdir bed_files
$ # create bed_files/My.Genes.bed
$ hgLoadBed abcDef1 My_Genes bed_files/My.Genes.bed # abcDef1 is your MySQL db
$ mkdir track_files
$ # create track_files/trackDb.ra - we discuss what to put in it below
$ hgTrackDb . abcDef1 trackDb $KENT/src/hg/lib/trackDb.sql track_files
$ # do this once:
$ hgFindSpec . abcDef1 hgFindSpec $KENT/src/hg/lib/hgFindSpec.sql track_files
The file trackDb.ra is a configuration file that has meta-data about all of your tracks (color, file format, priority, etc). Mine looks a bit like this:
track My_Genes
shortLabel My Predicted Genes
longLabel My Genes Version X Annotation
group genes
priority 4
visibility full
color 0,0,0
altcolor 0,0,0
searchTable My_Genes
searchType bed
searchMethod prefix
type bed 12 .

track gc5Base
shortLabel GC Percent
type wig 0 100
longLabel GC Percent in 5-Base Windows
visibility dense
group map
priority 20
colorR 250
colorG 130
colorB 0
altColorR 128
altColorG 128
altColorB 128

track My_Tissue_FPKM
shortLabel Tissue FPKM
longLabel My TopHat Mapping, Tissue Tissue, FPKM
color 100,255,100
group regulation
priority 21
visibility hide
autoScale Off
minLimit 0
maxLimit 100
type bedGraph 4
track is the name of the annotation table in your database. shortLabel and longLabel are additional labels that are displayed on the web browser. group defines the category of the tracks (e.g., Genes and Gene Prediction Tracks or Mapping and Sequencing Tracks). visibility controls the default starting visibility of the track. color is the track's color. searchTable, searchType, searchMethod allow you to search by gene name on the web browser. type is the file format, followed by the number of columns you want to be active.

You can upload various (sometimes idiosyncratic) data file formats to the browser, and you use different loaders in the Kent source tree depending on the format. We've already seen that hgLoadBed is for bed format. Another useful format is bedgraph, which has four simple columns: chromosome, start position, end position, and value. To load this file format:
$ hgLoadBed abcDef1 My_Tissue_FPKM myfile.bedgraph -bedGraph=4
As the voluminous genome browser docs say, "The bigWig format is for display of dense, continuous data that will be displayed in the Genome Browser as a graph." This format is useful if you have a giant file of data, because bigWig is a compressed binary. To make a bigWig, you can start with a bedgraph and convert it:
$ bedGraphToBigWig myfile.bedgraph chrom.sizes out.bw
To load a bigWig into the genome browser, use hgBbiDbLink, which puts a link to the .bw file into your species (abcDef1) database:
$ hgBbiDbLink abcDef1 myTrackName out.bw
As always, you'll need an entry for myTrackName in the trackDb.ra file.

You can read more about tracks here: Once you've added all your tracks, you are done. Congratulations! :)

More Troubleshooting: Getting it Running

A great resource for troubleshooting is the Google Group: I achieved immortality there a couple of times. Meanwhile, here are some errors you might see. If you haven't loaded your tracks, you'll get:

image

If you haven't run hgFindSpec, you'll get:

image

Once this is done, you should be in clover:

image

If you're getting:

image

when you expect to be getting:

image

the problem is in your trackDb.ra file. Make sure all the columns of your bed file are "hot": use:
type bed 12 .
not:
type bed 3
Finally, if you're getting:

image

don't forget to add the search terms in your trackDb.ra file.

Taking Out the Trash

The graphics you see on the the genome browser are actually .pngs it generates in:
trash/
Crazy, but this is how it works. If you don't empty out this folder it will balloon to many gigabytes. Since this needs to be done periodically, it's a job for cron. Make a cron file, mycron.txt:
30 16 * * * rm -r /path/trash/* 
And run it as:
$ crontab mycron.txt
This will empty the trash every day at 4:30 p.m.

Modifying the Nav Bar

Edit the file inc/globalNavBar.inc.

Finished Product

See the finished product at: Screenshot:

image
Advertising

image


image


image


image