It may be defined as the process of breaking up a piece of text into smaller parts, such as sentences and words. These smaller parts are called tokens. For example, a word is a token in a sentence, and a sentence is a token in a paragraph.
As we know that NLP is used to build applications such as sentiment analysis, QA systems, language translation, smart chatbots, voice systems, etc., hence, in order to build them, it becomes vital to understand the pattern in the text. The tokens, mentioned above, are very useful in finding and understanding these patterns. We can consider tokenization as the base step for other recipes such as stemming and lemmatization.
nltk.tokenize is the package provided by NLTK module to achieve the process of tokenization.
Splitting the sentence into words or creating a list of words from a string is an essential part of every text processing activity. Let us understand it with the help of various functions/modules provided by nltk.tokenize package.
word_tokenize module is used for basic word tokenization. Following example will use this module to split a sentence into words.
import nltk from nltk.tokenize import word_tokenize word_tokenize('Howcodex.com provides high quality technical tutorials for free.')
['Howcodex.com', 'provides', 'high', 'quality', 'technical', 'tutorials', 'for', 'free', '.']
word_tokenize module, used above is basically a wrapper function that calls tokenize() function as an instance of the TreebankWordTokenizer class. It will give the same output as we get while using word_tokenize() module for splitting the sentences into word. Let us see the same example implemented above −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −
from nltk.tokenize import TreebankWordTokenizer
Next, create an instance of TreebankWordTokenizer class as follows −
Tokenizer_wrd = TreebankWordTokenizer()
Now, input the sentence you want to convert to tokens −
Tokenizer_wrd.tokenize( 'Howcodex.com provides high quality technical tutorials for free.' )
[ 'Howcodex.com', 'provides', 'high', 'quality', 'technical', 'tutorials', 'for', 'free', '.' ]
Let us see the complete implementation example below
import nltk from nltk.tokenize import TreebankWordTokenizer tokenizer_wrd = TreebankWordTokenizer() tokenizer_wrd.tokenize('Howcodex.com provides high quality technical tutorials for free.')
[ 'Howcodex.com', 'provides', 'high', 'quality', 'technical', 'tutorials','for', 'free', '.' ]
The most significant convention of a tokenizer is to separate contractions. For example, if we use word_tokenize() module for this purpose, it will give the output as follows −
import nltk from nltk.tokenize import word_tokenize word_tokenize('won’t')
['wo', "n't"]]
Such kind of convention by TreebankWordTokenizer is unacceptable. That’s why we have two alternative word tokenizers namely PunktWordTokenizer and WordPunctTokenizer.
An alternative word tokenizer that splits all punctuation into separate tokens. Let us understand it with the following simple example −
from nltk.tokenize import WordPunctTokenizer tokenizer = WordPunctTokenizer() tokenizer.tokenize(" I can't allow you to go home early")
['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']
In this section we are going to split text/paragraph into sentences. NLTK provides sent_tokenize module for this purpose.
An obvious question that came in our mind is that when we have word tokenizer then why do we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to count average words in sentences, how we can do this? For accomplishing this task, we need both sentence tokenization and word tokenization.
Let us understand the difference between sentence and word tokenizer with the help of following simple example −
import nltk from nltk.tokenize import sent_tokenize text = "Let us understand the difference between sentence & word tokenizer. It is going to be a simple example." sent_tokenize(text)
[ "Let us understand the difference between sentence & word tokenizer.", 'It is going to be a simple example.' ]
If you feel that the output of word tokenizer is unacceptable and want complete control over how to tokenize the text, we have regular expression which can be used while doing sentence tokenization. NLTK provide RegexpTokenizer class to achieve this.
Let us understand the concept with the help of two examples below.
In first example we will be using regular expression for matching alphanumeric tokens plus single quotes so that we don’t split contractions like “won’t”.
import nltk from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer("[\w']+") tokenizer.tokenize("won't is a contraction.") tokenizer.tokenize("can't is a contraction.")
["won't", 'is', 'a', 'contraction'] ["can't", 'is', 'a', 'contraction']
In first example, we will be using regular expression to tokenize on whitespace.
import nltk from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer('/s+' , gaps = True) tokenizer.tokenize("won't is a contraction.")
["won't", 'is', 'a', 'contraction']
From the above output, we can see that the punctuation remains in the tokens. The parameter gaps = True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use gaps = False parameter then the pattern would be used to identify the tokens which can be seen in following example −
import nltk from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer('/s+' , gaps = False) tokenizer.tokenize("won't is a contraction.")
[ ]
It will give us the blank output.