PHRINGE Pipeline

Genome Project Solutions

Partnering for Discovery

Phylogenetic Resources for the Interpretation of Genomes


	Details of the PHRINGE Analysis Pipeline

		The PHRINGE pipeline has five stages: (1) an all-against-all BLASTP of the complete proteomes and selection of high-scoring pairs; (2) full length alignment of all similar pairs of genes followed by calculation of their distance; (3) iterative, hierarchical clustering that respects the evolutionary relationships among the organisms; (4) multiple sequence alignment for all genes in each cluster; and (5) constructing evolutionary trees of each cluster and determining orthologous / paralogous relationships among the genes in each cluster. These are described in more detail below. Back to the PHRINGE Summary Page











	(1) Inferred amino acid sequences of all gene models of all considered genomes are entered into an all-against-all search using BLASTP in order to identify similar sets of genes.


	(2) Each BLASTP alignment reports only portions of the amino acid sequences that are highly similar, so we now create full-length alignments for each similar protein pair, followed by a calculation of the distance between each pair using a substitution matrix.



	(3) The tree shown on the right indicates the evolutionary relationships among several hypothetical organisms, four from Clade A, two from Clade B, and one that is an outgroup. The right side of the figure illustrates a protein distance graph with circles representing proteins colored to conform to each organism, with the spatial distance of the circles proportional to their sequence distance. Clusters are created by identifying a pair of sequences (a seed) that is the shortest distance from any Clade A protein to any Clade B protein. The cluster is then grown by adding all proteins that have a shorter distance than the seed until no additions can be made. The blue cloud represents a cluster. The clustering is hierarchical, starting at the base of the tree of the organisms and working iteratively toward the tips, each time considering the genes on each side of the ingroup-outgroup split. By doing so, we place genes in the most basal clusters to which they can be traced, but also use the increased accuracy of analysis possible when considering only the more similar sets of genes.














		(4) A multiple sequence alignment is created for each cluster. This, as well as the comparative exon-intron structure are presented. (5) Evolutionary trees are created using real phylogenetic methods, rather than just using the error-prone sequence similarity methods in common use.












	The gene tree is reconciled with the known relationships of the organisms to determine, relative to lineage splitting, when each duplication or loss occurred, and so to infer the orthologous and paralogous relationships among the genes. We make extensive linking to functional genomics databases (below) and provide input into the Synteny Viewer (below right) where users can compare the physical relationships of genes identified as homologs among genomes.