Etetoolkit

Automate and integrate ETE Toolkit for powerful phylogenetic tree analysis and visualization

Source: K-Dense-AI/claude-scientific-skills

ETE Toolkit is a community skill for phylogenetic tree analysis and visualization using the ETE Python library, covering tree construction, annotation, comparison, manipulation, and publication-quality rendering for evolutionary biology research.

What Is This?

Overview

ETE Toolkit provides patterns for working with phylogenetic and hierarchical tree structures in Python. It covers Newick format parsing and writing for tree data exchange, tree traversal and node manipulation for structural analysis, phylogenetic tree comparison using Robinson-Foulds and other distance metrics, tree annotation with custom data attributes on nodes and branches, and high-quality tree visualization with customizable layouts and styles. The skill enables researchers to build analysis pipelines for evolutionary biology, taxonomy, and hierarchical data exploration, including workflows that integrate with sequence alignment tools and downstream statistical analysis.

Who Should Use This

This skill serves evolutionary biologists analyzing phylogenetic relationships between species or genes, bioinformaticians building tree comparison and annotation pipelines, and researchers creating publication-quality tree figures from analysis results. It is also useful for data scientists working with any hierarchical clustering output that can be expressed in Newick format.

Why Use It?

Problems It Solves

Parsing and manipulating tree structures from Newick format requires recursive algorithms that are complex to implement correctly. Comparing trees from different phylogenetic methods needs standardized distance metrics. Annotating tree nodes with metadata from external sources requires matching identifiers across datasets. Creating publication-ready tree visualizations with custom annotations demands specialized rendering capabilities.

Core Highlights

Tree objects parse Newick strings and provide traversal, search, and manipulation methods. Robinson-Foulds distance and other metrics compare tree topologies quantitatively. Node annotation attaches custom attributes like support values and species names. TreeStyle and NodeStyle classes produce customizable visualizations with faces and layouts.

How to Use It?

Basic Usage

from ete3 import Tree, TreeStyle, NodeStyle

t = Tree("((A:0.1,B:0.2):0.3,(C:0.3,D:0.4):0.5);")
print(t.get_ascii(show_internal=True))

for node in t.traverse("postorder"):
    if node.is_leaf():
        print(f"Leaf: {node.name}, "
              f"dist: {node.dist:.2f}")
    else:
        print(f"Internal, children: "
              f"{len(node.children)}")

node_a = t.search_nodes(name="A")[0]
print(f"A parent children: "
      f"{[n.name for n in node_a.up.children]}")

print(f"Leaves: {len(t.get_leaves())}")
print(f"Total branch length: "
      f"{sum(n.dist for n in t.traverse()):.2f}")

Real-World Examples

from ete3 import Tree

class PhyloAnalyzer:
    def __init__(self, newick: str):
        self.tree = Tree(newick)

    def compare_trees(self, other_newick: str
                      ) -> dict:
        other = Tree(other_newick)
        rf, max_rf, common, p1, p2, (
            edges1, edges2) = (
            self.tree.robinson_foulds(other))
        return {"rf_distance": rf,
                "max_rf": max_rf,
                "normalized_rf": rf / max_rf
                    if max_rf > 0 else 0.0,
                "common_leaves": len(common)}

    def annotate_from_table(
            self, annotations: dict) -> int:
        count = 0
        for leaf in self.tree.get_leaves():
            if leaf.name in annotations:
                for key, val in (
                        annotations[leaf.name]
                        .items()):
                    leaf.add_feature(key, val)
                count += 1
        return count

    def get_clade_members(
            self, leaf_names: list[str]
            ) -> list[str]:
        ancestor = self.tree.get_common_ancestor(
            leaf_names)
        return [l.name for l in
                ancestor.get_leaves()]

    def prune(self, keep: list[str]) -> str:
        pruned = self.tree.copy()
        pruned.prune(keep)
        return pruned.write()

tree_str = "((A:0.1,B:0.2):0.3,(C:0.3,(D:0.2,E:0.1):0.4):0.5);"
analyzer = PhyloAnalyzer(tree_str)
clade = analyzer.get_clade_members(["D", "E"])
print(f"Clade: {clade}")

Advanced Tips

Use tree.copy() before destructive operations like pruning to preserve the original tree structure. Apply midpoint rooting with tree.set_outgroup for unrooted trees before comparison. Cache Robinson-Foulds distances when comparing many tree pairs to avoid redundant computation. When annotating nodes with external metadata, use a dictionary keyed by leaf name for efficient lookup rather than iterating the annotations list for each node.

When to Use It?

Use Cases

Build a tree comparison pipeline that evaluates bootstrap support by computing Robinson-Foulds distances between replicate trees. Create an annotation tool that maps species metadata onto phylogenetic tree nodes for visualization. Implement a clade extraction workflow that identifies monophyletic groups from large phylogenies.

Important Notes

Requirements

Python with the ete3 package installed. PyQt5 for interactive tree visualization and rendering. Newick-formatted tree files as input data. Understanding of basic phylogenetic concepts like rooting, support values, and branch lengths.

Usage Recommendations

Do: validate Newick strings before parsing to catch formatting errors early. Use the copy method before operations that modify tree structure. Specify traversal strategy explicitly for deterministic node ordering.

Don't: modify tree nodes during traversal without using a copied node list, which can cause iteration errors. Compare trees with different leaf sets without pruning to common taxa first. Assume branch lengths are present when they may be omitted in Newick format.

Limitations

Interactive visualization requires a display server or Qt backend that may not be available in headless environments. Very large trees with thousands of leaves can be slow to render, so consider pruning to relevant clades before generating figures. Some advanced tree comparison methods require additional phylogenetic software.

More Skills You Might Like

Explore similar skills to enhance your workflow