Till now we have got chunks or phrases from sentences but what are we supposed to do with them. One of the important tasks is to transform them. But why? It is to do the following −
Suppose if you want to judge the meaning of a phrase then there are many commonly used words such as, ‘the’, ‘a’, are insignificant or useless. For example, see the following phrase −
‘The movie was good’.
Here the most significant words are ‘movie’ and ‘good’. Other words, ‘the’ and ‘was’ both are useless or insignificant. It is because without them also we can get the same meaning of the phrase. ‘Good movie’.
In the following python recipe, we will learn how to remove useless/insignificant words and keep the significant words with the help of POS tags.
First, by looking through treebank corpus for stopwords we need to decide which part-of-speech tags are significant and which are not. Let us see the following table of insignificant words and tags −
Word | Tag |
---|---|
a | DT |
All | PDT |
An | DT |
And | CC |
Or | CC |
That | WDT |
The | DT |
From the above table, we can see other than CC, all the other tags end with DT which means we can filter out insignificant words by looking at the tag’s suffix.
For this example, we are going to use a function named filter() which takes a single chunk and returns a new chunk without any insignificant tagged words. This function filters out any tags that end with DT or CC.
import nltk def filter(chunk, tag_suffixes=['DT', 'CC']): significant = [] for word, tag in chunk: ok = True for suffix in tag_suffixes: if tag.endswith(suffix): ok = False break if ok: significant.append((word, tag)) return (significant)
Now, let us use this function filter() in our Python recipe to delete insignificant words −
from chunk_parse import filter filter([('the', 'DT'),('good', 'JJ'),('movie', 'NN')])
[('good', 'JJ'), ('movie', 'NN')]
Many times, in real-world language we see incorrect verb forms. For example, ‘is you fine?’ is not correct. The verb form is not correct in this sentence. The sentence should be ‘are you fine?’ NLTK provides us the way to correct such mistakes by creating verb correction mappings. These correction mappings are used depending on whether there is a plural or singular noun in the chunk.
To implement Python recipe, we first need to need define verb correction mappings. Let us create two mapping as follows −
Plural to Singular mappings
plural= { ('is', 'VBZ'): ('are', 'VBP'), ('was', 'VBD'): ('were', 'VBD') }
Singular to Plural mappings
singular = { ('are', 'VBP'): ('is', 'VBZ'), ('were', 'VBD'): ('was', 'VBD') }
As seen above, each mapping has a tagged verb which maps to another tagged verb. The initial mappings in our example cover the basic of mappings is to are, was to were, and vice versa.
Next, we will define a function named verbs(), in which you can pass a chink with incorrect verb form and ‘ll get a corrected chunk back. To get it done, verb() function uses a helper function named index_chunk() which will search the chunk for the position of the first tagged word.
Let us see these functions −
def index_chunk(chunk, pred, start = 0, step = 1): l = len(chunk) end = l if step > 0 else -1 for i in range(start, end, step): if pred(chunk[i]): return i return None def tag_startswith(prefix): def f(wt): return wt[1].startswith(prefix) return f def verbs(chunk): vbidx = index_chunk(chunk, tag_startswith('VB')) if vbidx is None: return chunk verb, vbtag = chunk[vbidx] nnpred = tag_startswith('NN') nnidx = index_chunk(chunk, nnpred, start = vbidx+1) if nnidx is None: nnidx = index_chunk(chunk, nnpred, start = vbidx-1, step = -1) if nnidx is None: return chunk noun, nntag = chunk[nnidx] if nntag.endswith('S'): chunk[vbidx] = plural.get((verb, vbtag), (verb, vbtag)) else: chunk[vbidx] = singular.get((verb, vbtag), (verb, vbtag)) return chunk
Save these functions in a Python file in your local directory where Python or Anaconda is installed and run it. I have saved it as verbcorrect.py.
Now, let us call verbs() function on a POS tagged is you fine chunk −
from verbcorrect import verbs verbs([('is', 'VBZ'), ('you', 'PRP$'), ('fine', 'VBG')])
[('are', 'VBP'), ('you', 'PRP$'), ('fine','VBG')]
Another useful task is to eliminate passive voice from phrases. This can be done with the help of swapping the words around a verb. For example, ‘the tutorial was great’ can be transformed into ‘the great tutorial’.
To achieve this we are defining a function named eliminate_passive() that will swap the right-hand side of the chunk with the left-hand side by using the verb as the pivot point. In order to find the verb to pivot around, it will also use the index_chunk() function defined above.
def eliminate_passive(chunk): def vbpred(wt): word, tag = wt return tag != 'VBG' and tag.startswith('VB') and len(tag) > 2 vbidx = index_chunk(chunk, vbpred) if vbidx is None: return chunk return chunk[vbidx+1:] + chunk[:vbidx]
Now, let us call eliminate_passive() function on a POS tagged the tutorial was great chunk −
from passiveverb import eliminate_passive eliminate_passive( [ ('the', 'DT'), ('tutorial', 'NN'), ('was', 'VBD'), ('great', 'JJ') ] )
[('great', 'JJ'), ('the', 'DT'), ('tutorial', 'NN')]
As we know, a cardinal word such as 5, is tagged as CD in a chunk. These cardinal words often occur before or after a noun but for normalization purpose it is useful to put them before the noun always. For example, the date January 5 can be written as 5 January. Let us understand it with the following example.
To achieve this we are defining a function named swapping_cardinals() that will swap any cardinal that occurs immediately after a noun with the noun. With this the cardinal will occur immediately before the noun. In order to do equality comparison with the given tag, it uses a helper function which we named as tag_eql().
def tag_eql(tag): def f(wt): return wt[1] == tag return f
Now we can define swapping_cardinals() −
def swapping_cardinals (chunk): cdidx = index_chunk(chunk, tag_eql('CD')) if not cdidx or not chunk[cdidx-1][1].startswith('NN'): return chunk noun, nntag = chunk[cdidx-1] chunk[cdidx-1] = chunk[cdidx] chunk[cdidx] = noun, nntag return chunk
Now, Let us call swapping_cardinals() function on a date “January 5” −
from Cardinals import swapping_cardinals() swapping_cardinals([('Janaury', 'NNP'), ('5', 'CD')])
[('10', 'CD'), ('January', 'NNP')] 10 January