Opentopia Directory Encyclopedia Tools

Computational phylogenetics

Encyclopedia : C : CO : COM : Computational phylogenetics


Computational phylogenetics is the application of computational algorithms, methods and computer programs to phylogenetic analyses. Computational phylogenetics is closely related to, and makes extensive use of, sequence alignment in the process of constructing and refining phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species.

Sequence alignment and phylogenetic tree construction are intimately related due to the necessity of evaluating sequence relatedness in both processes. Progressive multiple sequence alignment (MSA) methods produce a phylogenetic tree by necessity because they incorporate new sequences into the alignment in order of relatedness. Correspondingly, some phylogenetic analysis methods construct multiple sequence alignments in the course of assigning branch roots and lengths. However, for a given gapped multiple sequence alignment, multiple rooted phylogenetic trees can be constructed that vary in their interpretations of which changes are "mutations" versus ancestral characters, and which events are insertion mutations or deletion mutations. For example, given only a pairwise alignment with a gap region, it is impossible to determine whether one sequence bears an insertion mutation or the other carries a deletion. The problem is magnified in MSAs with unaligned and nonoverlapping gapsFelsenstein J. (2004). Inferring Phylogenies Sinauer Associates: Sunderland, MA.. In practice, sizable regions of a calculated alignment may be discounted in phylogenetic tree construction to avoid integrating noisy data into the tree calculation.

Although a sequence alignment can always be constructed from a phylogenetic tree, many phylogenetics methods do not rely on the production of an initial or concurrent MSA. Such methods tend to rely on maximum parsimony or maximum likelihood analysis instead, although the three types of analysis are not mutually exclusive.

Maximum parsimony methods

Maximum parsimony (MP) is a method of identifying the potential phylogenetic tree that requires the smallest total number of evolutionary events to explain the observed sequence data. Some ways of scoring trees also include a "cost" associated with particular types of evolutionary events and attempt to locate the tree with the smallest total cost. This is a useful approach in cases where not every possible type of event is equally likely - for example, when particular nucleotides or amino acids are known to be more mutable than others.

The most naive way of identifying the most parsimonious tree is simple enumeration - considering each possible tree in succession and searching for the tree with the smallest score. However, this is only possible for a relatively small number of sequences or species because the problem of identifying the most parsimonious tree is known to be NP-hard; consequently a number of heuristic search methods for optimization have been developed to locate a highly parsimonious tree, if not the most optimal in the set. Most such methods involve a steepest descent style mechanism operating on a tree-rearrangement criterion.

Basic tree rearrangements

The simplest tree-rearrangement, known as nearest-neighbor interchange, exchanges the connectivity of four subtrees within the main tree. Because there are three possible ways of connecting four subtrees, and one is the original connectivity, each interchange creates two new trees. Exhaustively searching the possible nearest-neighbors for each possible set of subtrees is the slowest but most optimizing way of performing this search. An alternative, more wide-ranging search, subtree pruning and regrafting (SPR), selects and removes a subtree from the main tree and reinserts it elsewhere on the main tree to create a new node. Finally, tree bisection and reconnection (TBR) detaches a subtree from the main tree at an interior node and then attempts all possible connections between branches of the two trees thus created. The increasing complexity of the tree rearrangement technique correlates with increasing computational time required for the search, although not necessarily with their performanceTakahashi K, Nei M. (2000). Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol Biol Evol 17(8):1251-8..

Tree fusion

The simplest type of tree fusion begins with two trees already identified as near-optimal; thus, they most likely have the majority of their nodes correct but may fail to resolve individual tree "leaves" properly; for example, the separation ((A,B),(C,D)) at a branch tip versus ((A,C),(B,D)) may be unresolved. Tree fusion swaps these two solutions between two otherwise near-optimal trees. Variants of the method use standard genetic algorithms with a defined objective function to swap high-scoring subtrees into main trees that are high-scoring overallMatsuda H. (1996). Protein phylogenetic inference using maximum likelihood with a genetic algorithm. Pacific Symposium on Biocomputing 1996, pp512-23..

Branch and bound

The branch and bound algorithm is a general method used to increase the efficiency of searches for near-optimal solutions of NP-hard problems first applied to phylogenetics in the early 1980sHendy MD, Penny D. (1982). Branch and bound algorithms to determine minimal evolutionary trees. Math Biosci 60: 133-42.. Branch and bound is particularly well suited to phylogenetic tree construction because it inherently requires dividing a problem into a tree structure as it subdivides the problem space into smaller regions. As its name implies, it requires as input both a branching rule (in the case of phylogenetics, the addition of the next species or sequence to the tree) and a bound (a rule that excludes certain regions of the search space from consideration, assuming that the optimal solution cannot occupy that region). Identifying a good bound is the most challenging aspect of the algorithm's application to phylogenetics. A simple way of defining the bound is a maximum number of assumed evolutionary changes allowed per tree. A set of criteria known as Zharkikh's rulesRatner VA, Zharkikh AA, Kolchanov N, Rodin S, Solovyov S, Antonov AS. (1995). Molecular Evolution Biomathematics Series Vol 24. Springer-Verlag: New York, NY. severely limit the search space by defining characteristics shared by all candidate "most parsimonious" trees. The two most basic rules require the elimination of all but one redundant sequence (for cases where multiple observations have produced identical data) and the elimination of character sites at which two or more states do not occur in at least two species. Under ideal conditions these rules and their associated algorithm would completely define a tree.

Sankoff-Morel-Cedergren algorithm

The Sankoff-Morel-Cedergren algorithm was among the first published methods to simultaneously produce an MSA and a phylogenetic tree for nucleotide sequencesSankoff D, Morel C, Cedergren RJ. (1973). Evolution of 5S RNA and the non-randomness of base replacement. Nature New Biology 245:232-4.. The method uses a maximum parsimony calculation in conjunction with a scoring function that penalizes gaps and mismatches, thereby favoring the tree that introduces a minimal number of such events. The imputed sequences at the interior nodes of the tree are scored and summed over all the nodes in each possible tree. The lowest-scoring tree sum provides both an optimal tree and an optimal MSA given the scoring function. Because the method is highly computationally intensive, an approximate method in which initial guesses for the interior alignments are refined one node at a time. Both the full and the approximate version are in practice calculated by dynamic programming.

MALIGN and POY

More recent phylogenetic tree/MSA methods use heuristics to isolate high-scoring, but not necessarily optimal, trees. The MALIGN method uses a maximum-parsimony technique to compute a multiple alignment by maximizing a cladogram score, and its companion POY uses an iterative method that couples the optimization of the phylogenetic tree with improvements in the corresponding MSAWheeler WC, Gladstein DG. (1994). MALIGN: a multiple nucleic acid sequence alignment program. J Heredity 85: 417-18. However, the use of these methods in constructing evolutionary hypotheses has been criticized as biased due to the deliberate construction of trees reflecting minimal evolutionary eventsSimmons MP. (2004). Independence of alignment and tree search. Mol Phylogenet Evol 31(3):874-9.. Both programs are available for download from the [American Museum of Natural History].

Distance-matrix methods

Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they require an MSA as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatchesMount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of multiple sequence alignment. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions in multiple subtrees.

Neighbor-joining

Neighbor-joining methods apply general data clustering techniques to sequence analysis using genetic distance as a clustering metric. The simple neighbor-joining method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a molecular clock) across lineages. Its relative, UPGMA (Unweighted Pair Group Method with Arithmetic mean) produces rooted trees and requires a constant-rate assumption - that is, it assumes an ultrametric tree in which the distances from the root to every branch tip are equal.

Fitch-Margoliash method

The Fitch-Margoliash method uses a weighted least squares method for clustering based on genetic distanceFitch WM, Margoliash E. (1967). Construction of phylogenetic trees. Science 155: 279-84.. Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The linearity criterion for distances requires that the expected values of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances - a property that applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites. This correction is done through the use of a substitution matrix such as that derived from the Jukes-Cantor model of DNA evolution. The distance correction is only necessary in practice when the evolution rates differ among branches.

The least-squares criterion is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the dataset can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is NP-completeDay, WHE. (1986). Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of Mathematical Biology 49:461-7., so heuristic search methods like those used in maximum-parsimony analysis are applied to the search through tree space.

Maximum likelihood methods

The maximum likelihood method uses standard statistical techniques for inferring probability distributions to assign probabilities to particular possible phylogenetic trees. The method requires a substitution model to assess the probability of particular mutations; roughly, a tree that requires more mutations at interior nodes to explain the observed phylogeny will be assessed as having a lower probability. This is broadly similar to the maximum-parsimony method, but maximum likelihood allows additional statistical flexibility by permitting varying rates of evolution across both lineages and sites. In fact, the method requires that evolution at different sites and along different lineages must be statistically independent. Maximum likelihood is thus well suited to the analysis of distantly related sequences, but because it formally requires search of all possible combinations of tree topology and branch length, it is computationally expensive to perform on more than a few sequences.

The "pruning" algorithm, a variant of dynamic programming, is often used to reduce the search space by efficiently calculating the likelihood of subtrees. The method calculates the likelihood for each site in a "linear" manner, starting at a node whose only descendants are leaves (that is, the tips of the tree) and working backwards toward the "bottom" node in nested sets. However, the trees produced by the method are only rooted if the substitution model is irreversible, which is not generally true of biological systems. The search for the maximum-likelihood tree also includes a branch length optimization component that is difficult to improve upon algorithmically; general global optimization tools such as the Newton-Raphson method are often used. Searching tree topologies defined by likelihood has not been shown to be NP-complete, but remains extremely challenging because branch-and-bound search is not yet effective.

External links

References

[ edit]
Topics in phylogenetics
Relevant fields: phylogenetics | computational phylogenetics | molecular phylogeny | cladistics
Basic concepts: synapomorphy | phylogenetic tree | phylogenetic network | long branch attraction
Phylogeny inference methods: maximum parsimony | maximum likelihood | neighbour joining | UPGMA
Current topics: PhyloCode | DNA barcoding
List of evolutionary biology topics
[ edit]
Basic topics in evolutionary biology
Processes of evolution: adaptation - evidence - macroevolution - microevolution - speciation
Mechanisms: selection - genetic drift - gene flow - mutation
Modes: anagenesis - catagenesis - cladogenesis
History: History of evolutionary thought - Charles Darwin - The Origin of Species - modern evolutionary synthesis
Subfields: population genetics - ecological genetics - human evolution - molecular evolution - phylogenetics - systematics - evo-devo
List of evolutionary biology topics | Timeline of evolution | Timeline of human evolution

 


From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.

Search Titles
0123456789
ABCDEFGHIJ
KLMNOPQRST
UVWXYZ?

E-mail this article to:

Personal Message: