library known as beautifulsoup. Using this library, we can search for the values of html tags and get specific data like title of the page and the list of headers in the page.
Use the Anaconda package manager to install the required package and its dependent packages.
conda install Beaustifulsoap
In the below example we make a request to an url to be loaded into the python environment. Then use the html parser parameter to read the entire html file. Next, we print first few lines of the html page.
import urllib2 from bs4 import BeautifulSoup # Fetch the html file response = urllib2.urlopen('http://howcodex.com/python/python_overview.htm') html_doc = response.read() # Parse the html file soup = BeautifulSoup(html_doc, 'html.parser') # Format the parsed html file strhtm = soup.prettify() # Print the first few characters print (strhtm[:225])
When we execute the above code, it produces the following result.
<!DOCTYPE html> <!--[if IE 8]><html class="ie ie8"> <![endif]--> <!--[if IE 9]><html class="ie ie9"> <![endif]--> <!--[if gt IE 9]><!--> <html> <!--<![endif]--> <head> <!-- Basic --> <meta charset="utf-8"/> <title>
We can extract tag value from the first instance of the tag using the following code.
import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen('http://howcodex.com/python/python_overview.htm') html_doc = response.read() soup = BeautifulSoup(html_doc, 'html.parser') print (soup.title) print(soup.title.string) print(soup.a.string) print(soup.b.string)
When we execute the above code, it produces the following result.
Python Overview Python Overview None Python is Interpreted
We can extract tag value from all the instances of a tag using the following code.
import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen('http://howcodex.com/python/python_overview.htm') html_doc = response.read() soup = BeautifulSoup(html_doc, 'html.parser') for x in soup.find_all('b'): print(x.string)
When we execute the above code, it produces the following result.
Python is Interpreted Python is Interactive Python is Object-Oriented Python is a Beginner's Language Easy-to-learn Easy-to-read Easy-to-maintain A broad standard library Interactive Mode Portable Extendable Databases GUI Programming Scalable