Beautiful Soup - Searching the tree


Advertisements

There are many Beautifulsoup methods, which allows us to search a parse tree. The two most common and used methods are find() and find_all().

Before talking about find() and find_all(), let us see some examples of different filters you can pass into these methods.

Kinds of Filters

We have different filters which we can pass into these methods and understanding of these filters is crucial as these filters used again and again, throughout the search API. We can use these filters based on tag’s name, on its attributes, on the text of a string, or mixed of these.

A string

One of the simplest types of filter is a string. Passing a string to the search method and Beautifulsoup will perform a match against that exact string.

Below code will find all the <p> tags in the document −

>>> markup = BeautifulSoup('<p>Top Three</p><p><pre>Programming Languages are:</pre></p><p><b>Java, Python, Cplusplus</b></p>')
>>> markup.find_all('p')
[<p>Top Three</p>, <p></p>, <p><b>Java, Python, Cplusplus</b></p>]

Regular Expression

You can find all tags starting with a given string/tag. Before that we need to import the re module to use regular expression.

>>> import re
>>> markup = BeautifulSoup('<p>Top Three</p><p><pre>Programming Languages are:</pre></p><p><b>Java, Python, Cplusplus</b></p>')
>>>
>>> markup.find_all(re.compile('^p'))
[<p>Top Three</p>, <p></p>, <pre>Programming Languages are:</pre>, <p><b>Java, Python, Cplusplus</b></p>]

List

You can pass multiple tags to find by providing a list. Below code finds all the <b> and <pre> tags −

>>> markup.find_all(['pre', 'b'])
[<pre>Programming Languages are:</pre>, <b>Java, Python, Cplusplus</b>]

True

True will return all tags that it can find, but no strings on their own −

>>> markup.find_all(True)
[<html><body><p>Top Three</p><p></p><pre>Programming Languages are:</pre>
<p><b>Java, Python, Cplusplus</b> </p> </body></html>, 
<body><p>Top Three</p><p></p><pre> Programming Languages are:</pre><p><b>Java, Python, Cplusplus</b></p>
</body>, 
<p>Top Three</p>, <p></p>, <pre>Programming Languages are:</pre>, <p><b>Java, Python, Cplusplus</b></p>, <b>Java, Python, Cplusplus</b>]

To return only the tags from the above soup −

>>> for tag in markup.find_all(True):
(tag.name)
'html'
'body'
'p'
'p'
'pre'
'p'
'b'

find_all()

You can use find_all to extract all the occurrences of a particular tag from the page response as −

Syntax

find_all(name, attrs, recursive, string, limit, **kwargs)

Let us extract some interesting data from IMDB-“Top rated movies” of all time.

>>> url="https://www.imdb.com/chart/top/?ref_=nv_mv_250"
>>> content = requests.get(url)
>>> soup = BeautifulSoup(content.text, 'html.parser')
#Extract title Page
>>> print(soup.find('title'))
<title>IMDb Top 250 - IMDb</title>

#Extracting main heading
>>> for heading in soup.find_all('h1'):
   print(heading.text)
Top Rated Movies

#Extracting sub-heading
>>> for heading in soup.find_all('h3'):
   print(heading.text)
   
IMDb Charts
You Have Seen
   IMDb Charts
   Top India Charts
Top Rated Movies by Genre
Recently Viewed

From above, we can see find_all will give us all the items matching the search criteria we define. All the filters we can use with find_all() can be used with find() and other searching methods too like find_parents() or find_siblings().

find()

We have seen above, find_all() is used to scan the entire document to find all the contents but something, the requirement is to find only one result. If you know that the document contains only one <body> tag, it is waste of time to search the entire document. One way is to call find_all() with limit=1 every time or else we can use find() method to do the same −

Syntax

find(name, attrs, recursive, string, **kwargs)

So below two different methods gives the same output −

>>> soup.find_all('title',limit=1)
[<title>IMDb Top 250 - IMDb</title>]
>>>
>>> soup.find('title')
<title>IMDb Top 250 - IMDb</title>

In the above outputs, we can see the find_all() method returns a list containing single item whereas find() method returns single result.

Another difference between find() and find_all() method is −

>>> soup.find_all('h2')
[]
>>>
>>> soup.find('h2')

If soup.find_all() method can’t find anything, it returns empty list whereas find() returns None.

find_parents() and find_parent()

Unlike the find_all() and find() methods which traverse the tree, looking at tag’s descendents, find_parents() and find_parents methods() do the opposite, they traverse the tree upwards and look at a tag’s (or a string’s) parents.

Syntax

find_parents(name, attrs, string, limit, **kwargs)
find_parent(name, attrs, string, **kwargs)

>>> a_string = soup.find(string="The Godfather")
>>> a_string
'The Godfather'
>>> a_string.find_parents('a')
[<a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>]
>>> a_string.find_parent('a')
<a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
>>> a_string.find_parent('tr')
<tr>

<td class="posterColumn">
<span data-value="2" name="rk"></span>
<span data-value="9.149038526210072" name="ir"></span>
<span data-value="6.93792E10" name="us"></span>
<span data-value="1485540" name="nv"></span>
<span data-value="-1.850961473789928" name="ur"></span>
<a href="/title/tt0068646/"> <img alt="The Godfather" height="67" src="https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY67_CR1,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
2.
<a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
<span class="secondaryInfo">(1972)</span>
</td>
<td class="ratingColumn imdbRating">
<strong title="9.1 based on 1,485,540 user ratings">9.1</strong>
</td>
<td class="ratingColumn">
<div class="seen-widget seen-widget-tt0068646 pending" data-titleid="tt0068646">
<div class="boundary">
<div class="popover">
<span class="delete"> </span><ol><li>1<li>2<li>3<li>4<li>5<li>6<li>7<li>8<li>9<li>10</li>0</li></li></li></li&td;</li></li></li></li></li></ol> </div>
</div>
<div class="inline">
<div class="pending"></div>
<div class="unseeable">NOT YET RELEASED</div>
<div class="unseen"> </div>
<div class="rating"></div>
<div class="seen">Seen</div>
</div>
</div>
</td>
<td class="watchlistColumn">

<div class="wlb_ribbon" data-recordmetrics="true" data-tconst="tt0068646"></div>
</td>
</tr>
>>>
>>> a_string.find_parents('td')
[<td class="titleColumn">
2.
<a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
<span class="secondaryInfo">(1972)</span>
</td>]

There are eight other similar methods −

find_next_siblings(name, attrs, string, limit, **kwargs)
find_next_sibling(name, attrs, string, **kwargs)

find_previous_siblings(name, attrs, string, limit, **kwargs)
find_previous_sibling(name, attrs, string, **kwargs)

find_all_next(name, attrs, string, limit, **kwargs)
find_next(name, attrs, string, **kwargs)

find_all_previous(name, attrs, string, limit, **kwargs)
find_previous(name, attrs, string, **kwargs)

Where,

find_next_siblings() and find_next_sibling() methods will iterate over all the siblings of the element that come after the current one.

find_previous_siblings() and find_previous_sibling() methods will iterate over all the siblings that come before the current element.

find_all_next() and find_next() methods will iterate over all the tags and strings that come after the current element.

find_all_previous and find_previous() methods will iterate over all the tags and strings that come before the current element.

CSS selectors

The BeautifulSoup library to support the most commonly-used CSS selectors. You can search for elements using CSS selectors with the help of the select() method.

Here are some examples −

>>> soup.select('title')
[<title>IMDb Top 250 - IMDb</title>, <title>IMDb Top Rated Movies</title>]
>>>
>>> soup.select("p:nth-of-type(1)")
[<p>The Top Rated Movie list only includes theatrical features.</p>, <p> class="imdb-footer__copyright _2-iNNCFskmr4l2OFN2DRsf">© 1990-2019 by IMDb.com, Inc.</p>]
>>> len(soup.select("p:nth-of-type(1)"))
2
>>> len(soup.select("a"))
609
>>> len(soup.select("p"))
2

>>> soup.select("html head title")
[<title>IMDb Top 250 - IMDb</title>, <title>IMDb Top Rated Movies</title>]
>>> soup.select("head > title")
[<title>IMDb Top 250 - IMDb</title>]

#print HTML code of the tenth li elemnet
>>> soup.select("li:nth-of-type(10)")
[<li class="subnav_item_main">
<a href="/search/title?genres=film_noir&sort=user_rating,desc&title_type=feature&num_votes=25000,">Film-Noir
</a> </li>]
Advertisements