Some thoughts on boring stuff, and bioinformatics

Why you should use ete for tree exploration and visualisation in python !

If you work with trees (phylogenetics or not) and you regularly use python, you have probably used or heard about one of the following packages: Bio.phylo, dendropy or ETE.

While each one of those packages has its own unique strengths and weaknesses, I particularly like the ETE module. Here is why !

This post is based on one of my past presentation at monbug. I actually convert the ipython notebook to this markdown with nbconvert as described by Christopher S. Corley on his blog. The config I used with nbconvert can be found here. The github repository with all the original files for the presentation can be found here : monbug_ete. You can use nbviewer to view the notebook directly if you prefer.

What’s ETE ??

ETE is a python Environment for Tree Exploration created by Jaime Huerta-Cepas.

It’s a framework that assists in the manipulation of any type of hierarchical tree (ie reading, writing, visualisation, annotation, etc). The current latest version is ete3.

Installation

You can install ETE with pip : pip install ete3. Check this link for more details about optional/unmet dependencies : http://etetoolkit.org/download/

Quick introduction to the API

A great in-depth tutorial for working with tree data structure in ETE is provided by the authors : http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html. I’m going to make a light introduction to the API here, but I really recommend you to read the official doc!

Let’s take a quick glance at the available tree data structure in ete :

In [58]:

import ete3
import inspect
print([x[0] for x in inspect.getmembers(ete3, inspect.isclass) if x[0].endswith('Tree')])
['ClusterTree', 'EvolTree', 'NexmlTree', 'PhyloTree', 'PhyloxmlTree', 'Tree']

As you can see, you have a basic tree data structure (Tree) and more specialized tree structures, like PhyloTree for phylogenetics

=> ETE can read tree from a string or a file

In [59]:

from ete3 import Tree

rand_newick = "((((a,b), c), d), (e,f));"
rand_tree = "rand_tree"
with open(rand_tree, 'w') as TREE:
    TREE.write(rand_newick)

# Reading tree t1 and t2
t1 = Tree(rand_newick)
t2 = Tree(rand_tree)

=> In ete, a tree is a Node. This implies that the root is a Node, so are all its descendants.

In [61]:

print(t1)
            /-a
         /-|
      /-|   \-b
     |  |
   /-|   \-c
  |  |
--|   \-d
  |
  |   /-e
   \-|
      \-f

=> You can add information to nodes by adding features

The following code will traverse the tree t1 and add a feature sexiness to each leaf.

In [62]:

from numpy import random

# Traverse : levelorder, preorder, postorder
for node in t1.traverse("levelorder"):
    if node.is_leaf():
        # add a features : randomness
        node_rand = random.randint(10)
        node.add_features(sexiness=node_rand)

=> Features are just attributes.

In [63]:

# print t1 again with features : name and sexiness
print(t1.get_ascii(attributes=['name', 'sexiness']))
            /-a, 8
         /-|
      /-|   \-b, 1
     |  |
   /-|   \-c, 9
  |  |
--|   \-d, 3
  |
  |   /-e, 9
   \-|
      \-f, 3

=> You can search by features

In [64]:

# search by features
print(t1.search_nodes(sexiness=8))
print(t1.search_nodes(name='a'))
[Tree node 'a' (-0x7ffff810443aa570)]
[Tree node 'a' (-0x7ffff810443aa570)]

=> Here is a quick list of useful functions

In [65]:

# get sister node =====> get_sisters()
sister = (t1&'a').get_sisters()
print("\nSISTERS of  a : ")
print(sister)
SISTERS of a : 
[Tree node 'b' (0x7efbbc55ab0)]

In [66]:

# get children  =====> get_children()
root_children = t1.get_children()
print("\n\nFIRST CHILD OF ROOT")
print(root_children[0])
FIRST CHILD OF ROOT

         /-a
      /-|
   /-|   \-b
  |  |
--|   \-c
  |
   \-d

In [67]:

# Get the  LCA (Latest Common Ancestor) of multiple node ====> get_common_ancestor()
lca = t1.get_common_ancestor(['a', 'b'])
print("\n\nLCA (a, b) : ")
print(lca)
LCA (a, b) : 

   /-a
--|
   \-b

In [68]:

# RF (Robinson-Foulds) distance between t1 and t2.
# Recall that t1 and t2 have the same newick ...
rf = t1.robinson_foulds(t2)
print("\n\nRF DISTANCE between t1 and t2 :")
print(rf[0])
RF DISTANCE between t1 and t2 :
0

Introduction to tree visualization with ete

Data : a random tree with random branches * Tree rendering * Tree Style

In [71]:

from ete3 import Tree

# Generate a random tree (yule process)
t = Tree()
t.populate(8, names_library=list('ABCDEFGHIJKL'), random_branches=True)

print(t.get_ascii(attributes=['name', 'support'], show_internal=True))
               /-G, 0.47936
     /, 0.11319
    |         |          /-F, 0.53403
    |          \, 0.52094
-, 1.0                   \-E, 0.89822
    |
    |          /-L, 0.27682
     \, 0.32620
              |          /-K, 0.50173
               \, 0.07320
                        |          /-J, 0.14208
                         \, 0.93141
                                  |         /-I, 0.05555
                                   \, 0.87512
                                            \-H, 0.81088

=> Trees can be saved as images. Supported format are png, pdf and svg.

In [74]:

t.render('tree.png', dpi=200)

png

=> You can use TreeStyle to change how the tree is displayed

In [75]:

from ete3 import TreeStyle

ts = TreeStyle()
ts.show_branch_length = True # show branch length
ts.show_branch_support = True # show support

# rotate the tree by 30 degree
ts.rotation = -30
t.render('tree2.png', tree_style=ts)

png

Let’s draw a circular tree now

In [76]:

ts.rotation = 0
ts.mode = "c" # use circular mode 
ts.arc_start = -180 
ts.arc_span = 180
t.render('tree3.png', tree_style=ts, w=500)

png

=> faces are wonderful

faces allow you to add graphical informations to a node. It can be a simple Text, an Image or a more useful information like a Chart or Sequence domains.

Here is the list of available faces :

In [77]:

# Adding face to Tree
from ete3 import faces
print([f for f in dir(faces) if 'Face' in f])
Image('face_positions.png')
['AttrFace', 'BarChartFace', 'CircleFace', 'DynamicItemFace', 'Face', 'ImgFace', 'OLD_SequenceFace', 'PieChartFace', 'ProfileFace', 'RandomFace', 'RectFace', 'SeqMotifFace', 'SequenceFace', 'SequencePlotFace', 'StackedBarFace', 'StaticItemFace', 'TextFace', 'TreeFace']

Faces can be added at different areas around a node.

png

With Faces, you can actually make things like this (treeception) :

png

It’s also possible to define a layout function that will determine how a node will be rendered. Let’s see how to do that and in which cases this could be useful with the next example.

Application 1 : Duplication|Loss history of a gene familly

Data : genetree newick where I have already added a feature (states) :

  • states = 1 ==> internal node with duplication
  • states = 0 ==> internal node with speciation

In [80]:

from ete3 import Tree
t = Tree('annoted_trees', format=2)
print(t.get_ascii(show_internal=True, attributes=['name', 'states']))
      /-Dre_1, 0
   /, 0
  |  |   /-Cfa_1, 0
  |   \, 0
-, 1     \-Hsa_1, 0
  |
  |   /-Dre_2, 0
   \, 0
      \-Cfa_2, 0

In [81]:

from ete3 import Tree, faces, TreeStyle
import utils

# Creates a layout function
def mylayout(node):
    if node.is_leaf():
        # add a face for its scientific name
        longNameFace = faces.TextFace(utils.get_scientific_name(node))
        faces.add_face_to_node(longNameFace, node, column=1)

        # add an image Face
        node.img_style["size"] = 0
        image = utils.get_image(node.name)
        faces.add_face_to_node(faces.ImgFace(image), node, column=0, aligned=True)
    
    # If node is a duplication node
    elif int(node.states) == 1:
        # Set the style as a green square
        node.img_style["size"] = 6
        node.img_style["shape"] = "square"
        node.img_style["fgcolor"] = "green"

    # If node is a speciation node
    else :
        # Set the style as a red circle
        node.img_style["size"] = 6
        node.img_style["shape"] = "circle"
        node.img_style["fgcolor"] = "red"
        

# And, finally, display the tree using the layout function
ts = TreeStyle()
ts.show_leaf_name = False
ts.layout_fn = mylayout

t.render("tree4.png", dpi=600, tree_style = ts)

png

Application 2 : Phylogenetic tree, protein sequence and information content

Data : - An alignment - A tree constructed using that alignment (Actually those two were randomly generated)

In [82]:

from ete3 import PhyloNode, SequenceFace, faces, TreeStyle
from Bio import AlignIO
from Bio import Alphabet
from Bio.Align import AlignInfo
from utils import show_file

alignment = "alignment.fasta"
tree = "phylotree.nw"

# Open tree and link alignment to it
t = PhyloNode(tree)
t.link_to_alignment(alignment)
show_file(alignment)
show_file(tree)
>A
MAEIPDETIQQFMALT---SNIAVQYLSEFGDLNEALNSY
>B
MAEIPDATIQQFMALTNVSHNIAVQY--EFGDLNEALNSY
>C
MAEIPDATIQ----LTNVSHNIAVQYLSEFGDLNEALNSY
>D
MAEAPDETIQQFMALTNVSHNIAVQYLSEFGDLNEAL---

(A,(D,(B,C)));

In [83]:

# Compute Information content with Biopython
align = AlignIO.read(alignment, 'fasta', alphabet=Alphabet.Gapped(Alphabet.IUPAC.protein))
summary_info = AlignInfo.SummaryInfo(align)        
total_ic_content = summary_info.information_content()
ic_content = summary_info.ic_vector.values()

# Set TreeStyle
ts = TreeStyle()
ts.branch_vertical_margin = 10
ts.allow_face_overlap = False
ts.show_scale = False
ts.show_leaf_name = False

# Align ic plot to TreeStyle header
ic_plot = faces.SequencePlotFace(ic_content, fsize=10, col_width=14, header="Information Content", kind='bar', ylabel="ic")
ts.aligned_header.add_face(ic_plot, 1) 

#t.add_face(ic_plot,1)
t.render("%%inline", tree_style=ts, dpi=300)

png

You can do a lot of things with ete if you take the time to learn how to use it. I didn’t have time to talk about ClusterNode, EvolNode or all the other great modules of ete, but I hope this post spark your interest and was useful to you.

Also, READ THE DOCS.

Comments