Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.
Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.
In NLTK, stemmerI, which have stem() method, interface has all the stemmers which we are going to cover next. Let us understand it with the following diagram
It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of English words.
NLTK has PorterStemmer class with the help of which we can easily implement Porter Stemmer algorithms for the word we want to stem. This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the PorterStemmer class to implement the Porter Stemmer algorithm.
from nltk.stem import PorterStemmer
Next, create an instance of Porter Stemmer class as follows −
word_stemmer = PorterStemmer()
Now, input the word you want to stem.
word_stemmer.stem('writing')
'write'
word_stemmer.stem('eating')
'eat'
import nltk from nltk.stem import PorterStemmer word_stemmer = PorterStemmer() word_stemmer.stem('writing')
'write'
It was developed at Lancaster University and it is another very common stemming algorithms.
NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the LancasterStemmer class to implement Lancaster Stemmer algorithm
from nltk.stem import LancasterStemmer
Next, create an instance of LancasterStemmer class as follows −
Lanc_stemmer = LancasterStemmer()
Now, input the word you want to stem.
Lanc_stemmer.stem('eats')
'eat'
import nltk from nltk.stem import LancatserStemmer Lanc_stemmer = LancasterStemmer() Lanc_stemmer.stem('eats')
'eat'
With the help of this stemming algorithm, we can construct our own stemmer.
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the RegexpStemmer class to implement the Regular Expression Stemmer algorithm.
from nltk.stem import RegexpStemmer
Next, create an instance of RegexpStemmer class and provides the suffix or prefix you want to remove from the word as follows −
Reg_stemmer = RegexpStemmer(‘ing’)
Now, input the word you want to stem.
Reg_stemmer.stem('eating')
'eat'
Reg_stemmer.stem('ingeat')
'eat' Reg_stemmer.stem('eats')
'eat'
import nltk from nltk.stem import RegexpStemmer Reg_stemmer = RegexpStemmer() Reg_stemmer.stem('ingeat')
'eat'
It is another very useful stemming algorithm.
NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-English languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the SnowballStemmer class to implement Snowball Stemmer algorithm
from nltk.stem import SnowballStemmer
Let us see the languages it supports −
SnowballStemmer.languages
( 'arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish' )
Next, create an instance of SnowballStemmer class with the language you want to use. Here, we are creating the stemmer for ‘French’ language.
French_stemmer = SnowballStemmer(‘french’)
Now, call the stem() method and input the word you want to stem.
French_stemmer.stem (‘Bonjoura’)
'bonjour'
import nltk from nltk.stem import SnowballStemmer French_stemmer = SnowballStemmer(‘french’) French_stemmer.stem (‘Bonjoura’)
'bonjour'
Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.
NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the WordNetLemmatizer class to implement the lemmatization technique.
from nltk.stem import WordNetLemmatizer
Next, create an instance of WordNetLemmatizer class.
lemmatizer = WordNetLemmatizer()
Now, call the lemmatize() method and input the word of which you want to find lemma.
lemmatizer.lemmatize('eating')
'eating'
lemmatizer.lemmatize('books')
'book'
import nltk from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize('books')
'book'
Let us understand the difference between Stemming and Lemmatization with the help of the following example −
import nltk from nltk.stem import PorterStemmer word_stemmer = PorterStemmer() word_stemmer.stem('believes')
believ
import nltk from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize(' believes ')
believ
The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.