Beautiful Soup - Modifying the tree


Advertisements

One of the important aspects of BeautifulSoup is search the parse tree and it allows you to make changes to the web document according to your requirement. We can make changes to tag’s properties using its attributes, such as the .name, .string or .append() method. It allows you to add new tags and strings to an existing tag with the help of the .new_string() and .new_tag() methods. There are other methods too, such as .insert(), .insert_before() or .insert_after() to make various modification to your HTML or XML document.

Changing tag names and attributes

Once you have created the soup, it is easy to make modification like renaming the tag, make modification to its attributes, add new attributes and delete attributes.

>>> soup = BeautifulSoup('<b class="bolder">Very Bold</b>')
>>> tag = soup.b

Modification and adding new attributes are as follows −

>>> tag.name = 'Blockquote'
>>> tag['class'] = 'Bolder'
>>> tag['id'] = 1.1
>>> tag
<Blockquote class="Bolder" id="1.1">Very Bold</Blockquote>

Deleting attributes are as follows −

>>> del tag['class']
>>> tag
<Blockquote id="1.1">Very Bold</Blockquote>
>>> del tag['id']
>>> tag
<Blockquote>Very Bold</Blockquote>

Modifying .string

You can easily modify the tag’s .string attribute −

>>> markup = '<a href="https://www.howcodex.com/index.htm">Must for every <i>Learner>/i<</a>'
>>> Bsoup = BeautifulSoup(markup)
>>> tag = Bsoup.a
>>> tag.string = "My Favourite spot."
>>> tag
<a href="https://www.howcodex.com/index.htm">My Favourite spot.</a>

From above, we can see if the tag contains any other tag, they and all their contents will be replaced by new data.

append()

Adding new data/contents to an existing tag is by using tag.append() method. It is very much similar to append() method in Python list.

>>> markup = '<a href="https://www.howcodex.com/index.htm">Must for every <i>Learner</i></a>'
>>> Bsoup = BeautifulSoup(markup)
>>> Bsoup.a.append(" Really Liked it")
>>> Bsoup
<html><body><a href="https://www.howcodex.com/index.htm">Must for every <i>Learner</i> Really Liked it</a></body></html>
>>> Bsoup.a.contents
['Must for every ', <i>Learner</i>, ' Really Liked it']

NavigableString() and .new_tag()

In case you want to add a string to a document, this can be done easily by using the append() or by NavigableString() constructor −

>>> soup = BeautifulSoup("<b></b>")
>>> tag = soup.b
>>> tag.append("Start")
>>>
>>> new_string = NavigableString(" Your")
>>> tag.append(new_string)
>>> tag
<b>Start Your</b>
>>> tag.contents
['Start', ' Your']

Note: If you find any name Error while accessing the NavigableString() function, as follows−

NameError: name 'NavigableString' is not defined

Just import the NavigableString directory from bs4 package −

>>> from bs4 import NavigableString

We can resolve the above error.

You can add comments to your existing tag’s or can add some other subclass of NavigableString, just call the constructor.

>>> from bs4 import Comment
>>> adding_comment = Comment("Always Learn something Good!")
>>> tag.append(adding_comment)
>>> tag
<b>Start Your<!--Always Learn something Good!--></b>
>>> tag.contents
['Start', ' Your', 'Always Learn something Good!']

Adding a whole new tag (not appending to an existing tag) can be done using the Beautifulsoup inbuilt method, BeautifulSoup.new_tag() −

>>> soup = BeautifulSoup("<b></b>")
>>> Otag = soup.b
>>>
>>> Newtag = soup.new_tag("a", href="https://www.howcodex.com")
>>> Otag.append(Newtag)
>>> Otag
<b><a href="https://www.howcodex.com"></a></b>

Only the first argument, the tag name, is required.

insert()

Similar to .insert() method on python list, tag.insert() will insert new element however, unlike tag.append(), new element doesn’t necessarily go at the end of its parent’s contents. New element can be added at any position.

>>> markup = '<a href="https://www.djangoproject.com/community/">Django Official website <i>Huge Community base</i></a>'
>>> soup = BeautifulSoup(markup)
>>> tag = soup.a
>>>
>>> tag.insert(1, "Love this framework ")
>>> tag
<a href="https://www.djangoproject.com/community/">Django Official website Love this framework <i>Huge Community base</i></a>
>>> tag.contents
['Django Official website ', 'Love this framework ', <i>Huge Community base</i
>]
>>>

insert_before() and insert_after()

To insert some tag or string just before something in the parse tree, we use insert_before() −

>>> soup = BeautifulSoup("Brave")
>>> tag = soup.new_tag("i")
>>> tag.string = "Be"
>>>
>>> soup.b.string.insert_before(tag)
>>> soup.b
<b><i>Be</i>Brave</b>

Similarly to insert some tag or string just after something in the parse tree, use insert_after().

>>> soup.b.i.insert_after(soup.new_string(" Always "))
>>> soup.b
<b><i>Be</i> Always Brave</b>
>>> soup.b.contents
[<i>Be</i>, ' Always ', 'Brave']

clear()

To remove the contents of a tag, use tag.clear() −

>>> markup = '<a href="https://www.howcodex.com/index.htm">For <i>technical & Non-technical&lr;/i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> tag = soup.a
>>> tag
<a href="https://www.howcodex.com/index.htm">For <i>technical & Non-technical</i> Contents</a>
>>>
>>> tag.clear()
>>> tag
<a href="https://www.howcodex.com/index.htm"></a>

extract()

To remove a tag or strings from the tree, use PageElement.extract().

>>> markup = '<a href="https://www.howcodex.com/index.htm">For <i&gr;technical & Non-technical</i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>>
>>> i_tag = soup.i.extract()
>>>
>>> a_tag
<a href="https://www.howcodex.com/index.htm">For Contents</a>
>>>
>>> i_tag
<i>technical & Non-technical</i>
>>>
>>> print(i_tag.parent)
None

decompose()

The tag.decompose() removes a tag from the tree and deletes all its contents.

>>> markup = '<a href="https://www.howcodex.com/index.htm">For <i>technical & Non-technical</i> Contents</a>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>> a_tag
<a href="https://www.howcodex.com/index.htm">For <i>technical & Non-technical</i> Contents</a>
>>>
>>> soup.i.decompose()
>>> a_tag
<a href="https://www.howcodex.com/index.htm">For Contents</a>
>>>

Replace_with()

As the name suggests, pageElement.replace_with() function will replace the old tag or string with the new tag or string in the tree −

>>> markup = '<a href="https://www.howcodex.com/index.htm">Complete Python <i>Material</i></a>'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>>
>>> new_tag = soup.new_tag("Official_site")
>>> new_tag.string = "https://www.python.org/"
>>> a_tag.i.replace_with(new_tag)
<i>Material</i>
>>>
>>> a_tag
<a href="https://www.howcodex.com/index.htm">Complete Python <Official_site>https://www.python.org/</Official_site></a>

In the above output, you have noticed that replace_with() returns the tag or string that was replaced (like “Material” in our case), so you can examine it or add it back to another part of the tree.

wrap()

The pageElement.wrap() enclosed an element in the tag you specify and returns a new wrapper −

>>> soup = BeautifulSoup("<p>howcodex.com</p>")
>>> soup.p.string.wrap(soup.new_tag("b"))
<b>howcodex.com</b>
>>>
>>> soup.p.wrap(soup.new_tag("Div"))
<Div><p><b>howcodex.com</b></p></Div>

unwrap()

The tag.unwrap() is just opposite to wrap() and replaces a tag with whatever inside that tag.

>>> soup = BeautifulSoup('<a href="https://www.howcodex.com/">I liked <i>howcodex</i></a>')
>>> a_tag = soup.a
>>>
>>> a_tag.i.unwrap()
<i></i>
>>> a_tag
<a href="https://www.howcodex.com/">I liked howcodex</a>

From above, you have noticed that like replace_with(), unwrap() returns the tag that was replaced.

Below is one more example of unwrap() to understand it better −

>>> soup = BeautifulSoup("<p>I <strong>AM</strong> a <i>text</i>.</p>")
>>> soup.i.unwrap()
<i></i>
>>> soup
<html><body><p>I <strong>AM</strong> a text.</p></body></html>

unwrap() is good for striping out markup.

Advertisements