1. The pace of
whole genome sequencing is increasing dramatically. Soon
there will be complete genome sequences for hundreds of eukaryotes
and thousands of prokaryotes from across the Tree of Life (see
a list at Genomes
Online). It is imperative that we quickly develop better and
faster methods for interpreting these data. Further, this must
have an intuitive interface that enables biological discovery
even for those without computer programming skills.
2. Identifying
genes can be done much more accurately than ever before. DNA
sequencing has become so fast and inexpensive that every whole
eukaryotic genome sequencing project can have a very large number
of ESTs. For example, a single run of the Roche "Titanium"
system on cDNA can generate over one million ESTs with average
length of 450 nucleotides (~15-fold coverage on a typical transcriptome).
Alternatively, a single run of an Illumina system on cDNA can
generate 400 million short sequencing reads of up to 108 nucleotides
(~800-fold coverage on a typical transcriptome), each of which
can be mapped against the assembled genome sequence to outline
exons. The result is a set of gene models based on this real biological
data rather than the more highly inferential methods in common
usage, negating the need for the multi-track genome browser as
the organizing factor for genome data, a mode that has always
been clumsy at best.
3. Inferring gene
function can be done much more accurately than ever before, enabling
a much more "gene-centric" approach to presenting genome
data. In the absence of biochemical data, the most accurate inference
of function is by assigning orthologous relationships to genes
across various genomes. We have developed PHRINGE ("Phylogenetic
Resources for the Interpretation of Genomes"),
a tool that reconstructs the evolutionary history of all gene
families, sorting out their patterns of orthologous (related by
lineage splitting) and paralogous (related by gene duplication)
relationships. Not only is this important for understanding the
pattern and processes of genome evolution, but this phylogenetic
framework is the best tool for accurate inference of gene function.
This is in contrast with non-phylogenetic methods in common usage
that rely on simple similarity matching and that are known to
make errors, including the incorrect association of the pairs
of more slowly evolving paralogs and lack of annotation for those
more rapidly evolving. See an overview of the PHRINGE
project, details of the PHRINGE
pipeline, or further explanation of why it is important to
sort orthologs and
paralogs by evolutionary analysis.
1. The Genome Project
Statistics Page will contain (a) information about the organism,
such as its chromosome number, genome size, locality of collection,
and taxonomy, (b) information about the sequencing process, such
as the methods used, the number of runs, and the number of nucleotides
produced, (c) information about the genome assembly, such as the
N50 and L50 measures, distribution of scaffold sizes, number of
nucleotides contained, and the proportion of the genome represented,
(d) a description of the analysis methods, and (e) an overview
of the results, such as a description of the number of genes with
identified homologs, size distributions for exons and introns,
repeat content, known transposons, number of rRNA and tRNA genes,
base composition and CpG content and its variation, and patterns
of codon usage.
2. Users will be
able to see a complete genome overview with annotations of all
genes, including those for proteins, tRNAs, and rRNAs, pseudogenes,
all identified SNPs, and transposons and other repeated elements.
Users will be able to zoom out for a bird's eye overview and zoom
in to the nucleotide level.
3. Users will be
able to search for matches to sequences they provide as input
and to download individual genes, regions of specified size adjacent
to specified genes or lists of genes, contigs and scaffolds, gene
sets, and translations of gene products.
4. Users will be
able to view and compare folded structures for tRNAs and rRNAs.
5. All genes will
be assigned functions insofar as possible by transferring gene
descriptions from orthologous genes, and will also be assigned
to categories of biochemical function using GO
and KEGG
and by identifying contained Pfam
domains.
6. Users will be
able to sort and compare genes using a wide variety of parameters
that GATOR will measure or calculate, including GO
category, domain
content, EC number, length, confidence measures of both gene structure
and function, KEGG
category, numbers of genes in gene families, number of exons,
intracellular
location, presence of transmembrane
domains, molecular weight, isoelectric point, hydrophobicity,
aliphatic index, antigenicity, SNPs contained, and whether there
is a structure in PDB
for this or a homologous gene.
7. In all cases
where there is a structure in PDB
for this gene or any homolog, there will be a link to that structure
and, if appropriate, a plot of the amino acid differences between
it and the gene presented.
8. Users will be
able to specify any gene in any included genome and immediately
identify all orthologs
and paralogs in the presented genome.
9. All genes will
be compared with those of a number of other genomes using our
PHRINGE pipeline.
Users will be able to specify any gene and, for all other genomes
included in this analysis (or any specified subset), view all
those that are similar in sequence, identify all orthologs and
paralogs, compare intron-exon structures, view and download multiple
sequence alignments, see evolutionary trees of all genes, and
compare the relative locations of all orthologous genes mapped
onto scaffolds. The PHRINGE portion of the GATOR system is already
fully functional and being used for a project on oomycete and
diatom genomes; see these results here,
and for the genomes of the monarch butterfly, Danaus plexippus
and the platyfish, Xiphophorus maculatus.