Requests - Web Scraping using Requests


Advertisements

We have already seen how we can get data from a given URL using python requests library. We will try to scrap the data from the site of Howcodex which is available at https://www.howcodex.com/tutorialslibrary.htm using the following −

  • Requests Library
  • Beautiful soup library from python

We have already installed the Requests library, let us now install Beautiful soup package. Here is the official website for beautiful soup available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ in case you want to explore some more functionalities of beautiful soup.

Installing Beautifulsoup

We shall see how to install Beautiful Soup below −

E:\prequests>pip install beautifulsoup4
Collecting beautifulsoup4
Downloading https://files.pythonhosted.org/packages/3b/c8/a55eb6ea11cd7e5ac4ba
cdf92bac4693b90d3ba79268be16527555e186f0/beautifulsoup4-4.8.1-py3-none-any.whl
(
101kB)
|████████████████████████████████| 102kB 22kB/s
Collecting soupsieve>=1.2 (from beautifulsoup4)
Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0
a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.1 soupsieve-1.9.5

We now have python requests library and beautiful soup installed.

Let us now write the code, that will scrap the data from the URL given.

Web scraping

import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.howcodex.com/tutorialslibrary.htm')
print("The status code is ", res.status_code)
print("\n")
soup_data = BeautifulSoup(res.text, 'html.parser')
print(soup_data.title)
print("\n")
print(soup_data.find_all('h4'))

Using requests library, we can fetch the content from the URL given and beautiful soup library helps to parse it and fetch the details the way we want.

You can use a beautiful soup library to fetch data using Html tag, class, id, css selector and many more ways. Following is the output we get wherein we have printed the title of the page and also all the h4 tags on the page.

Output

E:\prequests>python makeRequest.py
The status code is 200
<title>Free Online Tutorials and Courses</title>
[<h4>Academic</h4>, <h4>Computer Science</h4>, <h4>Digital Marketing</h4>, 
<h4>Monuments</h4>,<h4>Machine Learning</h4>, <h4>Mathematics</h4>, 
<h4>Mobile Development</h4>,<h4>SAP</h4>, 
<h4>Software Quality</h4>, <h4>Big Data & Analytics</h4>, 
<h4>Databases</h4>, <h4>Engineering Tutorials</h4>, 
<h4>Mainframe Development</h4>, 
<h4>Microsoft Technologies</h4>, <h4>Java Technologies</h4>,
<h4>XML Technologies</h4>, <h4>Python Technologies</h4>, <h4>Sports</h4>, 
<h4>Computer Programming</h4>,<h4>DevOps</h4>, <h4>Latest Technologies</h4>, 
<h4>Telecom</h4>, <h4>Exams Syllabus</h4>, 
<h4>UPSC IAS Exams</h4>, 
<h4>Web Development</h4>,
<h4>Scripts</h4>, <h4>Management</h4>,<h4>Soft Skills</h4>, 
<h4>Selected Reading</h4>, <h4>Misc</h4>]
Advertisements