Beautiful Soup - Souping the Page


Advertisements

In the previous code example, we parse the document through beautiful constructor using a string method. Another way is to pass the document through open filehandle.

from bs4 import BeautifulSoup
with open("example.html") as fp:
   soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data</html>")

First the document is converted to Unicode, and HTML entities are converted to Unicode characters:</p>

import bs4
html = '''<b>howcodex</b>, <i>&web scraping &data science;</i>'''
soup = bs4.BeautifulSoup(html, 'lxml')
print(soup)

Output

<html><body><b>howcodex</b>, <i>&web scraping &data science;</i></body></html>

BeautifulSoup then parses the data using HTML parser or you explicitly tell it to parse using an XML parser.

HTML tree Structure

Before we look into different components of a HTML page, let us first understand the HTML tree structure.

HTML Tree Structure

The root element in the document tree is the html, which can have parents, children and siblings and this determines by its position in the tree structure. To move among HTML elements, attributes and text, you have to move among nodes in your tree structure.

Let us suppose the webpage is as shown below −

Howcodex Online Library

Which translates to an html document as follows −

<html><head><title>Howcodex</title></head><h1>Howcodex Online Library</h1><p<<b>It's all Free</b></p></body></html>

Which simply means, for above html document, we have a html tree structure as follows −

HTML Document
Advertisements