In this chapter, let us understand how to perform web scraping and processing CAPTCHA that is used for testing a user for human or robot.
The full form of CAPTCHA is Completely Automated Public Turing test to tell Computers and Humans Apart, which clearly suggests that it is a test to determine whether the user is human or not.
A CAPTCHA is a distorted image which is usually not easy to detect by computer program but a human can somehow manage to understand it. Most of the websites use CAPTCHA to prevent bots from interacting.
Suppose we want to do registration on a website and there is form with CAPTCHA, then before loading the CAPTCHA image we need to know about the specific information required by the form. With the help of next Python script we can understand the form requirements of registration form on website named http://example.webscrapping.com.
import lxml.html import urllib.request as urllib2 import pprint import http.cookiejar as cookielib def form_parsing(html): tree = lxml.html.fromstring(html) data = {} for e in tree.cssselect('form input'): if e.get('name'): data[e.get('name')] = e.get('value') return data REGISTER_URL = '<a target="_blank" rel="nofollow" href="http://example.webscraping.com/user/register">http://example.webscraping.com/user/register'</a> ckj = cookielib.CookieJar() browser = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckj)) html = browser.open( '<a target="_blank" rel="nofollow" href="http://example.webscraping.com/places/default/user/register?_next"> http://example.webscraping.com/places/default/user/register?_next</a> = /places/default/index' ).read() form = form_parsing(html) pprint.pprint(form)
In the above Python script, first we defined a function that will parse the form by using lxml python module and then it will print the form requirements as follows −
{ '_formkey': '5e306d73-5774-4146-a94e-3541f22c95ab', '_formname': 'register', '_next': '/places/default/index', 'email': '', 'first_name': '', 'last_name': '', 'password': '', 'password_two': '', 'recaptcha_response_field': None }
You can check from the above output that all the information except recpatcha_response_field are understandable and straightforward. Now the question arises that how we can handle this complex information and download CAPTCHA. It can be done with the help of pillow Python library as follows;
Pillow is a fork of the Python Image library having useful functions for manipulating images. It can be installed with the help of following command −
pip install pillow
In the next example we will use it for loading the CAPTCHA −
from io import BytesIO import lxml.html from PIL import Image def load_captcha(html): tree = lxml.html.fromstring(html) img_data = tree.cssselect('div#recaptcha img')[0].get('src') img_data = img_data.partition(',')[-1] binary_img_data = img_data.decode('base64') file_like = BytesIO(binary_img_data) img = Image.open(file_like) return img
The above python script is using pillow python package and defining a function for loading CAPTCHA image. It must be used with the function named form_parser() that is defined in the previous script for getting information about the registration form. This script will save the CAPTCHA image in a useful format which further can be extracted as string.
After loading the CAPTCHA in a useful format, we can extract it with the help of Optical Character Recognition (OCR), a process of extracting text from the images. For this purpose, we are going to use open source Tesseract OCR engine. It can be installed with the help of following command −
pip install pytesseract
Here we will extend the above Python script, which loaded the CAPTCHA by using Pillow Python Package, as follows −
import pytesseract img = get_captcha(html) img.save('captcha_original.png') gray = img.convert('L') gray.save('captcha_gray.png') bw = gray.point(lambda x: 0 if x < 1 else 255, '1') bw.save('captcha_thresholded.png')
The above Python script will read the CAPTCHA in black and white mode which would be clear and easy to pass to tesseract as follows −
pytesseract.image_to_string(bw)
After running the above script we will get the CAPTCHA of registration form as the output.