Stemming & Lemmatization


Advertisements

What is Stemming?

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.

Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.

Various Stemming algorithms

In NLTK, stemmerI, which have stem() method, interface has all the stemmers which we are going to cover next. Let us understand it with the following diagram

Stemming Algorithms

Porter stemming algorithm

It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of English words.

PorterStemmer class

NLTK has PorterStemmer class with the help of which we can easily implement Porter Stemmer algorithms for the word we want to stem. This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning. Let us see an example −

First, we need to import the natural language toolkit(nltk).

import nltk

Now, import the PorterStemmer class to implement the Porter Stemmer algorithm.

from nltk.stem import PorterStemmer

Next, create an instance of Porter Stemmer class as follows −

word_stemmer = PorterStemmer()

Now, input the word you want to stem.

word_stemmer.stem('writing')

Output

'write'

word_stemmer.stem('eating')

Output

'eat'

Complete implementation example

import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('writing')

Output

'write'

Lancaster stemming algorithm

It was developed at Lancaster University and it is another very common stemming algorithms.

LancasterStemmer class

NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem. Let us see an example −

First, we need to import the natural language toolkit(nltk).

import nltk

Now, import the LancasterStemmer class to implement Lancaster Stemmer algorithm

from nltk.stem import LancasterStemmer

Next, create an instance of LancasterStemmer class as follows −

Lanc_stemmer = LancasterStemmer()

Now, input the word you want to stem.

Lanc_stemmer.stem('eats')

Output

'eat'

Complete implementation example

import nltk
from nltk.stem import LancatserStemmer
Lanc_stemmer = LancasterStemmer()
Lanc_stemmer.stem('eats')

Output

'eat'

Regular Expression stemming algorithm

With the help of this stemming algorithm, we can construct our own stemmer.

RegexpStemmer class

NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example −

First, we need to import the natural language toolkit(nltk).

import nltk

Now, import the RegexpStemmer class to implement the Regular Expression Stemmer algorithm.

from nltk.stem import RegexpStemmer

Next, create an instance of RegexpStemmer class and provides the suffix or prefix you want to remove from the word as follows −

Reg_stemmer = RegexpStemmer(‘ing’)

Now, input the word you want to stem.

Reg_stemmer.stem('eating')

Output

'eat'

Reg_stemmer.stem('ingeat')

Output

'eat'
Reg_stemmer.stem('eats')

Output

'eat'

Complete implementation example

import nltk
from nltk.stem import RegexpStemmer
Reg_stemmer = RegexpStemmer()
Reg_stemmer.stem('ingeat')

Output

'eat'

Snowball stemming algorithm

It is another very useful stemming algorithm.

SnowballStemmer class

NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-English languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method. Let us see an example −

First, we need to import the natural language toolkit(nltk).

import nltk

Now, import the SnowballStemmer class to implement Snowball Stemmer algorithm

from nltk.stem import SnowballStemmer

Let us see the languages it supports −

SnowballStemmer.languages

Output

(
   'arabic',
   'danish',
   'dutch',
   'english',
   'finnish',
   'french',
   'german',
   'hungarian',
   'italian',
   'norwegian',
   'porter',
   'portuguese',
   'romanian',
   'russian',
   'spanish',
   'swedish'
)

Next, create an instance of SnowballStemmer class with the language you want to use. Here, we are creating the stemmer for ‘French’ language.

French_stemmer = SnowballStemmer(‘french’)

Now, call the stem() method and input the word you want to stem.

French_stemmer.stem (‘Bonjoura’)

Output

'bonjour'

Complete implementation example

import nltk
from nltk.stem import SnowballStemmer
French_stemmer = SnowballStemmer(‘french’)
French_stemmer.stem (‘Bonjoura’)

Output

'bonjour'

What is Lemmatization?

Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −

Example

First, we need to import the natural language toolkit(nltk).

import nltk

Now, import the WordNetLemmatizer class to implement the lemmatization technique.

from nltk.stem import WordNetLemmatizer

Next, create an instance of WordNetLemmatizer class.

lemmatizer = WordNetLemmatizer()

Now, call the lemmatize() method and input the word of which you want to find lemma.

lemmatizer.lemmatize('eating')

Output

'eating'
lemmatizer.lemmatize('books')

Output

'book'

Complete implementation example

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('books')

Output

'book'

Difference between Stemming & Lemmatization

Let us understand the difference between Stemming and Lemmatization with the help of the following example −

import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('believes')

Output

believ

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize(' believes ')

Output

believ

The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.

Advertisements