Stemming and lemmatization can be considered as a kind of linguistic compression. In the same sense, word replacement can be thought of as text normalization or error correction.
But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues with contractions (like can’t, won’t, etc.). So, to handle such issues we need word replacement. For example, we can replace contractions with their expanded forms.
First, we are going to replace words that matches the regular expression. But for this we must have a basic understanding of regular expressions as well as python re module. In the example below, we will be replacing contraction with their expanded forms (e.g. “can’t” will be replaced with “cannot”), all that by using regular expressions.
First, import the necessary package re to work with regular expressions.
import re from nltk.corpus import wordnet
Next, define the replacement patterns of your choice as follows −
R_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'), (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), ]
Now, create a class that can be used for replacing words −
class REReplacer(object): def __init__(self, pattern = R_patterns): self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.pattern: s = re.sub(pattern, repl, s) return s
Save this python program (say repRE.py) and run it from python command prompt. After running it, import REReplacer class when you want to replace words. Let us see how.
from repRE import REReplacer rep_word = REReplacer() rep_word.replace("I won't do it") Output: 'I will not do it' rep_word.replace("I can’t do it") Output: 'I cannot do it'
import re from nltk.corpus import wordnet R_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'), (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), ] class REReplacer(object): def __init__(self, patterns=R_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: s = re.sub(pattern, repl, s) return s
Now once you saved the above program and run it, you can import the class and use it as follows −
from replacerRE import REReplacer rep_word = REReplacer() rep_word.replace("I won't do it")
'I will not do it'
One of the common practices while working with natural language processing (NLP) is to clean up the text before text processing. In this concern we can also use our REReplacer class created above in previous example, as a preliminary step before text processing i.e. tokenization.
from nltk.tokenize import word_tokenize from replacerRE import REReplacer rep_word = REReplacer() word_tokenize("I won't be able to do this now") Output: ['I', 'wo', "n't", 'be', 'able', 'to', 'do', 'this', 'now'] word_tokenize(rep_word.replace("I won't be able to do this now")) Output: ['I', 'will', 'not', 'be', 'able', 'to', 'do', 'this', 'now']
In the above Python recipe, we can easily understand the difference between the output of word tokenizer without and with using regular expression replace.
Do we strictly grammatical in our everyday language? No, we are not. For example, sometimes we write ‘Hiiiiiiiiiiii Mohan’ in order to emphasize the word ‘Hi’. But computer system does not know that ‘Hiiiiiiiiiiii’ is a variation of the word “Hi”. In the example below, we will be creating a class named rep_word_removal which can be used for removing the repeating words.
First, import the necessary package re to work with regular expressions
import re from nltk.corpus import wordnet
Now, create a class that can be used for removing the repeating words −
class Rep_word_removal(object): def __init__(self): self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') self.repl = r'\1\2\3' def replace(self, word): if wordnet.synsets(word): return word repl_word = self.repeat_regexp.sub(self.repl, word) if repl_word != word: return self.replace(repl_word) else: return repl_word
Save this python program (say removalrepeat.py) and run it from python command prompt. After running it, import Rep_word_removal class when you want to remove the repeating words. Let us see how?
from removalrepeat import Rep_word_removal rep_word = Rep_word_removal() rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii") Output: 'Hi' rep_word.replace("Hellooooooooooooooo") Output: 'Hello'
import re from nltk.corpus import wordnet class Rep_word_removal(object): def __init__(self): self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') self.repl = r'\1\2\3' def replace(self, word): if wordnet.synsets(word): return word replace_word = self.repeat_regexp.sub(self.repl, word) if replace_word != word: return self.replace(replace_word) else: return replace_word
Now once you saved the above program and run it, you can import the class and use it as follows −
from removalrepeat import Rep_word_removal rep_word = Rep_word_removal() rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")
'Hi'