newick's Introduction

Parsing Newick trees

The Newick tree file format represents trees in a text with nested parentheses. You can specify branch lengths and similar annotations, but here we will just assume that a tree is either a simple alpha-numeric string, which we consider a leaf, or it is a list of comma-separated trees surrounded by parentheses. Something like this:

tree = '(A, (B, C))'

In this exercise, we will parse such trees and build a tree of them, represented with classes for leaves and inner nodes:

@dataclass
class Leaf:
    """
    A leaf in a tree.

    This will just be a string for our application.
    """

    name: str


@dataclass
class Node:
    """An inner node."""

    children: list[Tree]


# A tree is either a leaf or an inner node with sub-trees
Tree = Union[Leaf, Node]

Dealing directly with raw text when you want to parse something is usually a bad idea. The first step you always do is that you translate the text into a list of “tokens”, i.e., a list of the meaningful parts of the string. While it isn’t completely safe, you can use this function for the exercise:

import re
def tokenize(tree: str) -> list[str]:
    return re.findall(r'[()]|\w+', tree)

It turns the string (A,(B,C)) into ['(', 'A', '(', 'B', 'C', ')', ')'].

Here is the idea for parsing the tree:

Whenever you see a token that isn’t ), push it on the stack.
If you see a token that is neither ( or ) it must be a leaf, so create a leaf and push it to the stack.
When you see ), iteratively pop trees from the stack and collect them in a list. When you reach the first ( you stop popping, you create the new tree, and you push it on the stack.

Exercise: Implement a Newick parser based on this idea.

Exercise: If you take the tree ((A,B),C,((D,E),F)), in which order will your algorithm created Leafs and Nodes?

Recommend Projects

birc-ctib / newick Goto Github PK

newick's Introduction

Parsing Newick trees

newick's People

Contributors

Watchers

Forkers

newick's Issues

newick

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent