Following are the two reasons to transform the trees −
The first recipe we are going to discuss here is to convert a Tree or subtree back to a sentence or chunk string. This is very simple, let us see in the following example −
from nltk.corpus import treebank_chunk tree = treebank_chunk.chunked_sents()[2] ' '.join([w for w, t in tree.leaves()])
'Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a nonexecutive director of this British industrial conglomerate .'
Deep trees of nested phrases can’t be used for training a chunk hence we must flatten them before using. In the following example, we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.
To achieve this, we are defining a function named deeptree_flat() that will take a single Tree and will return a new Tree that keeps only the lowest level trees. In order to do most of the work, it uses a helper function which we named as childtree_flat().
from nltk.tree import Tree def childtree_flat(trees): children = [] for t in trees: if t.height() < 3: children.extend(t.pos()) elif t.height() == 3: children.append(Tree(t.label(), t.pos())) else: children.extend(flatten_childtrees([c for c in t])) return children def deeptree_flat(tree): return Tree(tree.label(), flatten_childtrees([c for c in tree]))
Now, let us call deeptree_flat() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named deeptree.py.
from deeptree import deeptree_flat from nltk.corpus import treebank deeptree_flat(treebank.parsed_sents()[2])
Tree('S', [Tree('NP', [('Rudolph', 'NNP'), ('Agnew', 'NNP')]), (',', ','), Tree('NP', [('55', 'CD'), ('years', 'NNS')]), ('old', 'JJ'), ('and', 'CC'), Tree('NP', [('former', 'JJ'), ('chairman', 'NN')]), ('of', 'IN'), Tree('NP', [('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP')]), (',', ','), ('was', 'VBD'), ('named', 'VBN'), Tree('NP-SBJ', [('*-1', '-NONE-')]), Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]), ('of', 'IN'), Tree('NP', [('this', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])
In the previous section, we flatten a deep tree of nested phrases by only keeping the lowest level subtrees. In this section, we are going to keep only the highest-level subtrees i.e. to build the shallow tree. In the following example we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.
To achieve this, we are defining a function named tree_shallow() that will eliminate all the nested subtrees by keeping only the top subtree labels.
from nltk.tree import Tree def tree_shallow(tree): children = [] for t in tree: if t.height() < 3: children.extend(t.pos()) else: children.append(Tree(t.label(), t.pos())) return Tree(tree.label(), children)
Now, let us call tree_shallow() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named shallowtree.py.
from shallowtree import shallow_tree from nltk.corpus import treebank tree_shallow(treebank.parsed_sents()[2])
Tree('S', [Tree('NP-SBJ-1', [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','), ('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'), ('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ',')]), Tree('VP', [('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])
We can see the difference with the help of getting the height of the trees −
from nltk.corpus import treebank tree_shallow(treebank.parsed_sents()[2]).height()
3
from nltk.corpus import treebank treebank.parsed_sents()[2].height()
9
In parse trees there are variety of Tree label types that are not present in chunk trees. But while using parse tree to train a chunker, we would like to reduce this variety by converting some of Tree labels to more common label types. For example, we have two alternative NP subtrees namely NP-SBL and NP-TMP. We can convert both of them into NP. Let us see how to do it in the following example.
To achieve this we are defining a function named tree_convert() that takes following two arguments −
This function will return a new Tree with all matching labels replaced based on the values in the mapping.
from nltk.tree import Tree def tree_convert(tree, mapping): children = [] for t in tree: if isinstance(t, Tree): children.append(convert_tree_labels(t, mapping)) else: children.append(t) label = mapping.get(tree.label(), tree.label()) return Tree(label, children)
Now, let us call tree_convert() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named converttree.py.
from converttree import tree_convert from nltk.corpus import treebank mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'} convert_tree_labels(treebank.parsed_sents()[2], mapping)
Tree('S', [Tree('NP-SBJ-1', [Tree('NP', [Tree('NNP', ['Rudolph']), Tree('NNP', ['Agnew'])]), Tree(',', [',']), Tree('UCP', [Tree('ADJP', [Tree('NP', [Tree('CD', ['55']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]), Tree('CC', ['and']), Tree('NP', [Tree('NP', [Tree('JJ', ['former']), Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NNP', ['Consolidated']), Tree('NNP', ['Gold']), Tree('NNP', ['Fields']), Tree('NNP', ['PLC'])])])])]), Tree(',', [','])]), Tree('VP', [Tree('VBD', ['was']),Tree('VP', [Tree('VBN', ['named']), Tree('S', [Tree('NP', [Tree('-NONE-', ['*-1'])]), Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']), Tree('NN', ['conglomerate'])])])])])])]), Tree('.', ['.'])])