GATOR Project

Genome Project Solutions

Partnering for Discovery

Genome Analysis Tools and Online Resources

The GATOR Project

GATOR ("Genome Analysis Tools and Online Resources") is the next generation of tools for interpreting and presenting whole genome data, currently under development at Genome Project Solutions.

What has changed to create the need for this?

1. The pace of whole genome sequencing is increasing dramatically. Soon there will be complete genome sequences for hundreds of eukaryotes and thousands of prokaryotes from across the Tree of Life (see a list at Genomes Online). It is imperative that we quickly develop better and faster methods for interpreting these data. Further, this must have an intuitive interface that enables biological discovery even for those without computer programming skills.

2. Identifying genes can be done much more accurately than ever before. DNA sequencing has become so fast and inexpensive that every whole eukaryotic genome sequencing project can have a very large number of ESTs. For example, a single run of the Roche "Titanium" system on cDNA can generate over one million ESTs with average length of 450 nucleotides (~15-fold coverage on a typical transcriptome). Alternatively, a single run of an Illumina system on cDNA can generate 400 million short sequencing reads of up to 108 nucleotides (~800-fold coverage on a typical transcriptome), each of which can be mapped against the assembled genome sequence to outline exons. The result is a set of gene models based on this real biological data rather than the more highly inferential methods in common usage, negating the need for the multi-track genome browser as the organizing factor for genome data, a mode that has always been clumsy at best.

3. Inferring gene function can be done much more accurately than ever before, enabling a much more "gene-centric" approach to presenting genome data. In the absence of biochemical data, the most accurate inference of function is by assigning orthologous relationships to genes across various genomes. We have developed PHRINGE ("Phylogenetic Resources for the Interpretation of Genomes"), a tool that reconstructs the evolutionary history of all gene families, sorting out their patterns of orthologous (related by lineage splitting) and paralogous (related by gene duplication) relationships. Not only is this important for understanding the pattern and processes of genome evolution, but this phylogenetic framework is the best tool for accurate inference of gene function. This is in contrast with non-phylogenetic methods in common usage that rely on simple similarity matching and that are known to make errors, including the incorrect association of the pairs of more slowly evolving paralogs and lack of annotation for those more rapidly evolving. See an overview of the PHRINGE project, details of the PHRINGE pipeline, or further explanation of why it is important to sort orthologs and paralogs by evolutionary analysis.

What will GATOR do?

1. The Genome Project Statistics Page will contain (a) information about the organism, such as its chromosome number, genome size, locality of collection, and taxonomy, (b) information about the sequencing process, such as the methods used, the number of runs, and the number of nucleotides produced, (c) information about the genome assembly, such as the N50 and L50 measures, distribution of scaffold sizes, number of nucleotides contained, and the proportion of the genome represented, (d) a description of the analysis methods, and (e) an overview of the results, such as a description of the number of genes with identified homologs, size distributions for exons and introns, repeat content, known transposons, number of rRNA and tRNA genes, base composition and CpG content and its variation, and patterns of codon usage.

2. Users will be able to see a complete genome overview with annotations of all genes, including those for proteins, tRNAs, and rRNAs, pseudogenes, all identified SNPs, and transposons and other repeated elements. Users will be able to zoom out for a bird's eye overview and zoom in to the nucleotide level.

3. Users will be able to search for matches to sequences they provide as input and to download individual genes, regions of specified size adjacent to specified genes or lists of genes, contigs and scaffolds, gene sets, and translations of gene products.

4. Users will be able to view and compare folded structures for tRNAs and rRNAs.

5. All genes will be assigned functions insofar as possible by transferring gene descriptions from orthologous genes, and will also be assigned to categories of biochemical function using GO and KEGG and by identifying contained Pfam domains.

6. Users will be able to sort and compare genes using a wide variety of parameters that GATOR will measure or calculate, including GO category, domain content, EC number, length, confidence measures of both gene structure and function, KEGG category, numbers of genes in gene families, number of exons, intracellular location, presence of transmembrane domains, molecular weight, isoelectric point, hydrophobicity, aliphatic index, antigenicity, SNPs contained, and whether there is a structure in PDB for this or a homologous gene.

7. In all cases where there is a structure in PDB for this gene or any homolog, there will be a link to that structure and, if appropriate, a plot of the amino acid differences between it and the gene presented.

8. Users will be able to specify any gene in any included genome and immediately identify all orthologs and paralogs in the presented genome.

9. All genes will be compared with those of a number of other genomes using our PHRINGE pipeline. Users will be able to specify any gene and, for all other genomes included in this analysis (or any specified subset), view all those that are similar in sequence, identify all orthologs and paralogs, compare intron-exon structures, view and download multiple sequence alignments, see evolutionary trees of all genes, and compare the relative locations of all orthologous genes mapped onto scaffolds. The PHRINGE portion of the GATOR system is already fully functional and being used for a project on oomycete and diatom genomes; see these results here, and for the genomes of the monarch butterfly, Danaus plexippus and the platyfish, Xiphophorus maculatus.

(GATOR is under development.)