jsoup is a Java based library to work with HTML based content. It provides a very convenient API to extract and manipulate data, using the best of DOM, CSS, and jquery-like methods. It implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
jsoup libary implements the WHATWG HTML5 specification, and parses an HTML content to the same DOM as per the modern browsers.
jsonp library provides following functionalities.
Multiple Read Support − It reads and parses HTML using URL, file, or string.
CSS Selectors − It can find and extract data, using DOM traversal or CSS selectors.
DOM Manipulation − It can manipulate the HTML elements, attributes, and text.
Prevent XSS attacks − It can clean user-submitted content against a given safe white-list, to prevent XSS attacks.
Tidy − It outputs tidy HTML.
Handles invalid data − jsoup can handle unclosed tags, implicit tags and can reliably create the document structure.
JUnit is a framework for Java, so the very first requirement is to have JDK installed in your machine.
JDK | 1.5 or above. |
---|---|
Memory | No minimum requirement. |
Disk Space | No minimum requirement. |
Operating System | No minimum requirement. |
First of all, open the console and execute a java command based on the operating system you are working on.
OS | Task | Command |
---|---|---|
Windows | Open Command Console | c:\> java -version |
Linux | Open Command Terminal | $ java -version |
Mac | Open Terminal | machine:< joseph$ java -version |
Let's verify the output for all the operating systems −
OS | Output |
---|---|
Windows | java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b07) Java HotSpot(TM) Client VM (build 17.0-b17, mixed mode, sharing) |
Linux | java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b07) Java HotSpot(TM) Client VM (build 17.0-b17, mixed mode, sharing) |
Mac | java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b07) Java HotSpot(TM)64-Bit Server VM (build 17.0-b17, mixed mode, sharing) |
If you do not have Java installed on your system, then download the Java Software Development Kit (SDK) from the following link https://www.oracle.com. We are assuming Java 1.6.0_21 as the installed version for this tutorial.
Set the JAVA_HOME environment variable to point to the base directory location where Java is installed on your machine. For example.
OS | Output |
---|---|
Windows | Set the environment variable JAVA_HOME to C:\Program Files\Java\jdk1.6.0_21 |
Linux | export JAVA_HOME = /usr/local/java-current |
Mac | export JAVA_HOME = /Library/Java/Home |
Append Java compiler location to the System Path.
OS | Output |
---|---|
Windows | Append the string C:\Program Files\Java\jdk1.6.0_21\bin at the end of the system variable, Path. |
Linux | export PATH = $PATH:$JAVA_HOME/bin/ |
Mac | not required |
Verify Java installation using the command java -version as explained above.
Download the latest version of jsoup jar file from Maven Repository. At the time of writing this tutorial, we have downloaded jsoup-1.8.3.jar and copied it into C:\>jsoup folder.
OS | Archive name |
---|---|
Windows | jsoup-1.8.3.jar |
Linux | jsoup-1.8.3.jar |
Mac | jsoup-1.8.3.jar |
Set the JSOUP_HOME environment variable to point to the base directory location where jsoup jar is stored on your machine. Let's assuming we've stored jsoup-1.8.3.jar in the JSOUP folder.
Sr.No | OS & Description |
---|---|
1 | Windows Set the environment variable JSOUP_HOME to C:\JSOUP |
2 | Linux export JSOUP_HOME = /usr/local/JSOUP |
3 | Mac export JSOUP_HOME = /Library/JSOUP |
Set the CLASSPATH environment variable to point to the JSOUP jar location.
Sr.No | OS & Description |
---|---|
1 | Windows Set the environment variable CLASSPATH to %CLASSPATH%;%JSOUP_HOME%\jsoup-1.8.3.jar;.; |
2 | Linux export CLASSPATH = $CLASSPATH:$JSOUP_HOME/jsoup-1.8.3.jar:. |
3 | Mac export CLASSPATH = $CLASSPATH:$JSOUP_HOME/jsoup-1.8.3.jar:. |
Following example will showcase parsing an HTML String into a Document object.
Document document = Jsoup.parse(html);
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
The parse(String html) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html dom.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body><p>Sample Content</p></body></html>"; Document document = Jsoup.parse(html); System.out.println(document.title()); Elements paragraphs = document.getElementsByTag("p"); for (Element paragraph : paragraphs) { System.out.println(paragraph.text()); } } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Sample Title Sample Content
Following example will showcase parsing an HTML fragement String into a Element object as html body.
Document document = Jsoup.parseBodyFragment(html); Element body = document.body();
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML fragment String.
body − represents element children of the document's body element and is equivalent to document.getElementsByTag("body").
The parseBodyFragment(String html) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html body fragment.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = "<div><p>Sample Content</p>"; Document document = Jsoup.parseBodyFragment(html); Element body = document.body(); Elements paragraphs = body.getElementsByTag("p"); for (Element paragraph : paragraphs) { System.out.println(paragraph.text()); } } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Sample Content
Following example will showcase fetching an HTML from the web using a url and then find its data.
String url = "http://www.google.com"; Document document = Jsoup.connect(url).get();
Where
document − document object represents the HTML DOM.
Jsoup − main class to connect the url and get the HTML String.
url − url of the html page to load.
The connect(url) method makes a connection to the url and get() method return the html of the requested url.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupTester { public static void main(String[] args) throws IOException { String url = "http://www.google.com"; Document document = Jsoup.connect(url).get(); System.out.println(document.title()); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Following example will showcase fetching an HTML from the disk using a file and then find its data.
String url = "http://www.google.com"; Document document = Jsoup.connect(url).get();
Where
document − document object represents the HTML DOM.
Jsoup − main class to connect the url and get the HTML String.
url − url of the html page to load.
The connect(url) method makes a connection to the url and get() method return the html of the requested url.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import java.io.File; import java.io.IOException; import java.net.URISyntaxException; import java.net.URL; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupTester { public static void main(String[] args) throws IOException, URISyntaxException { URL path = ClassLoader.getSystemResource("test.htm"); File input = new File(path.toURI()); Document document = Jsoup.parse(input, "UTF-8"); System.out.println(document.title()); } }
test.htm
Create following test.htm file in C:\jsoup folder.
<html> <head> <title>Sample Title</title> </head> <body> <p>Sample Content</p> </body> </html>
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Sample Title
Following example will showcase use of DOM like methods after parsing an HTML String into a Document object.
Document document = Jsoup.parse(html); Element sampleDiv = document.getElementById("sampleDiv"); Elements links = sampleDiv.getElementsByTag("a");
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
sampleDiv − Element object represent the html node element identified by id "sampleDiv".
links − Elements object represents the multiple node elements identified by tag "a".
The parse(String html) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html dom.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body>" + "<p>Sample Content</p>" + "<div id='sampleDiv'><a href='www.google.com'>Google</a></div>" +"</body></html>"; Document document = Jsoup.parse(html); System.out.println(document.title()); Elements paragraphs = document.getElementsByTag("p"); for (Element paragraph : paragraphs) { System.out.println(paragraph.text()); } Element sampleDiv = document.getElementById("sampleDiv"); System.out.println("Data: " + sampleDiv.text()); Elements links = sampleDiv.getElementsByTag("a"); for (Element link : links) { System.out.println("Href: " + link.attr("href")); System.out.println("Text: " + link.text()); } } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Sample Title Sample Content Data: Google Href: www.google.com Text: Google
Following example will showcase use of selector methods after parsing an HTML String into a Document object. jsoup supports selectors similar to CSS Selectors.
Document document = Jsoup.parse(html); Element sampleDiv = document.getElementById("sampleDiv"); Elements links = sampleDiv.getElementsByTag("a");
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
sampleDiv − Element object represent the html node element identified by id "sampleDiv".
links − Elements object represents the multiple node elements identified by tag "a".
The document.select(expression) method parses the given CSS selector expression to select a html dom element.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body>" + "<p>Sample Content</p>" + "<div id='sampleDiv'><a href='www.google.com'>Google</a>" + "<h3><a>Sample</a><h3>" +"</div>" + "<div id='imageDiv' class='header'><img name='google' src='google.png' />" + "<img name='yahoo' src='yahoo.jpg' />" +"</div>" +"</body></html>"; Document document = Jsoup.parse(html); //a with href Elements links = document.select("a[href]"); for (Element link : links) { System.out.println("Href: " + link.attr("href")); System.out.println("Text: " + link.text()); } // img with src ending .png Elements pngs = document.select("img[src$=.png]"); for (Element png : pngs) { System.out.println("Name: " + png.attr("name")); } // div with class=header Element headerDiv = document.select("div.header").first(); System.out.println("Id: " + headerDiv.id()); // direct a after h3 Elements sampleLinks = document.select("h3 > a"); for (Element link : sampleLinks) { System.out.println("Text: " + link.text()); } } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Href: www.google.com Text: Google Name: google Id: imageDiv Text: Sample
Following example will showcase use of method to get attribute of a dom element after parsing an HTML String into a Document object.
Document document = Jsoup.parse(html); Element link = document.select("a").first(); System.out.println("Href: " + link.attr("href"));
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
link − Element object represent the html node element representing anchor tag.
link.attr() − attr(attribute) method retrives the element attribute.
Element object represent a dom elment and provides various method to get the attribute of a dom element.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body>" + "<p>Sample Content</p>" + "<div id='sampleDiv'><a href='www.google.com'>Google</a>" + "<h3><a>Sample</a><h3>" +"</div>" +"</body></html>"; Document document = Jsoup.parse(html); //a with href Element link = document.select("a").first(); System.out.println("Href: " + link.attr("href")); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Href: www.google.com
Following example will showcase use of methods to get text after parsing an HTML String into a Document object.
Document document = Jsoup.parse(html); Element link = document.select("a").first(); System.out.println("Text: " + link.text());
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
link − Element object represent the html node element representing anchor tag.
link.text() − text() method retrives the element text.
Element object represent a dom elment and provides various method to get the text of a dom element.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body>" + "<p>Sample Content</p>" + "<div id='sampleDiv'><a href='www.google.com'>Google</a>" + "<h3><a>Sample</a><h3>" +"</div>" +"</body></html>"; Document document = Jsoup.parse(html); //a with href Element link = document.select("a").first(); System.out.println("Text: " + link.text()); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Text: Google
Following example will showcase use of methods to get inner html and outer html after parsing an HTML String into a Document object.
Document document = Jsoup.parse(html); Element link = document.select("a").first(); System.out.println("Outer HTML: " + link.outerHtml()); System.out.println("Inner HTML: " + link.html());
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
link − Element object represent the html node element representing anchor tag.
link.outerHtml() − outerHtml() method retrives the element complete html.
link.html() − html() method retrives the element inner html.
Element object represent a dom elment and provides various method to get the html of a dom element.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body>" + "<p>Sample Content</p>" + "<div id='sampleDiv'><a href='www.google.com'>Google</a>" + "<h3><a>Sample</a><h3>" +"</div>" +"</body></html>"; Document document = Jsoup.parse(html); //a with href Element link = document.select("a").first(); System.out.println("Outer HTML: " + link.outerHtml()); System.out.println("Inner HTML: " + link.html()); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Outer HTML: <a href="www.google.com">Google</a> Inner HTML: Google
Following example will showcase methods which can provide relative as well as absolute URLs present in the html page.
String url = "http://www.howcodex.com/"; Document document = Jsoup.connect(url).get(); Element link = document.select("a").first(); System.out.println("Relative Link: " + link.attr("href")); System.out.println("Absolute Link: " + link.attr("abs:href")); System.out.println("Absolute Link: " + link.absUrl("href"));
Where
document − document object represents the HTML DOM.
Jsoup − main class to connect to a url and get the html content.
link − Element object represent the html node element representing anchor tag.
link.attr("href") − provides the value of href present in anchor tag. It may be relative or absolute.
link.attr("abs:href") − provides the absolute url after resolving against the document's base URI.
link.absUrl("href") − provides the absolute url after resolving against the document's base URI.
Element object represent a dom elment and provides methods to get relative as well as absolute URLs present in the html page.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) throws IOException { String url = "http://www.howcodex.com/"; Document document = Jsoup.connect(url).get(); Element link = document.select("a").first(); System.out.println("Relative Link: " + link.attr("href")); System.out.println("Absolute Link: " + link.attr("abs:href")); System.out.println("Absolute Link: " + link.absUrl("href")); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Relative Link: index.htm Absolute Link: https://www.howcodex.com/index.htm Absolute Link: https://www.howcodex.com/index.htm
Following example will showcase use of method to set attributes of a dom element, bulk updates and add/remove class methods after parsing an HTML String into a Document object.
Document document = Jsoup.parse(html); Element link = document.select("a").first(); link.attr("href","www.yahoo.com"); link.addClass("header"); link.removeClass("header");
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
link − Element object represent the html node element representing anchor tag.
link.attr() − attr(attribute,value) method set the element attribute the corresponding value.
link.addClass() − addClass(class) method add the class under class attribute.
link.removeClass() − removeClass(class) method remove the class under class attribute.
Element object represent a dom elment and provides various method to get the attribute of a dom element.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body>" + "<p>Sample Content</p>" + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>" + "<div class='comments'><a href='www.sample1.com'>Sample1</a>" + "<a href='www.sample2.com'>Sample2</a>" + "<a href='www.sample3.com'>Sample3</a><div>" +"</div>" + "<div id='imageDiv' class='header'><img name='google' src='google.png' />" + "<img name='yahoo' src='yahoo.jpg' />" +"</div>" +"</body></html>"; Document document = Jsoup.parse(html); //Example: set attribute Element link = document.getElementById("googleA"); System.out.println("Outer HTML Before Modification :" + link.outerHtml()); link.attr("href","www.yahoo.com"); System.out.println("Outer HTML After Modification :" + link.outerHtml()); System.out.println("---"); //Example: add class Element div = document.getElementById("sampleDiv"); System.out.println("Outer HTML Before Modification :" + div.outerHtml()); link.addClass("header"); System.out.println("Outer HTML After Modification :" + div.outerHtml()); System.out.println("---"); //Example: remove class Element div1 = document.getElementById("imageDiv"); System.out.println("Outer HTML Before Modification :" + div1.outerHtml()); div1.removeClass("header"); System.out.println("Outer HTML After Modification :" + div1.outerHtml()); System.out.println("---"); //Example: bulk update Elements links = document.select("div.comments a"); System.out.println("Outer HTML Before Modification :" + links.outerHtml()); links.attr("rel", "nofollow"); System.out.println("Outer HTML Before Modification :" + links.outerHtml()); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Outer HTML Before Modification :<a id="googleA" href="www.google.com">Google</a> Outer HTML After Modification :<a id="googleA" href="www.yahoo.com">Google</a> --- Outer HTML Before Modification :<div id="sampleDiv"> <a id="googleA" href="www.yahoo.com">Google</a> </div> Outer HTML After Modification :<div id="sampleDiv"> <a id="googleA" href="www.yahoo.com" class="header">Google</a> </div> --- Outer HTML Before Modification :<div id="imageDiv" class="header"> <img name="google" src="google.png"> <img name="yahoo" src="yahoo.jpg"> </div> Outer HTML After Modification :<div id="imageDiv" class=""> <img name="google" src="google.png"> <img name="yahoo" src="yahoo.jpg"> </div> --- Outer HTML Before Modification :<a href="www.sample1.com">Sample1</a> <a href="www.sample2.com">Sample2</a> <a href="www.sample3.com">Sample3</a> Outer HTML Before Modification :<a href="www.sample1.com" rel="nofollow">Sample1</a> <a href="www.sample2.com" rel="nofollow">Sample2</a> <a href="www.sample3.com" rel="nofollow">Sample3</a>
Following example will showcase use of method to set, prepend or append html to a dom element after parsing an HTML String into a Document object.
Document document = Jsoup.parse(html); Element div = document.getElementById("sampleDiv"); div.html("<p>This is a sample content.</p>"); div.prepend("<p>Initial Text</p>"); div.append("<p>End Text</p>");
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
div − Element object represent the html node element representing anchor tag.
div.html() − html(content) method replaces the element's outer html with the corresponding value.
div.prepend() − prepend(content) method adds the content before the outer html.
div.append() − append(content) method adds the content after the outer html.
Element object represent a dom elment and provides various method to set, prepend or append html to a dom element.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body>" + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>" +"</body></html>"; Document document = Jsoup.parse(html); Element div = document.getElementById("sampleDiv"); System.out.println("Outer HTML Before Modification :\n" + div.outerHtml()); div.html("<p>This is a sample content.</p>"); System.out.println("Outer HTML After Modification :\n" + div.outerHtml()); div.prepend("<p>Initial Text</p>"); System.out.println("After Prepend :\n" + div.outerHtml()); div.append("<p>End Text</p>"); System.out.println("After Append :\n" + div.outerHtml()); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Outer HTML Before Modification : <div id="sampleDiv"> <a id="googleA" href="www.google.com">Google</a> </div> Outer HTML After Modification : <div id="sampleDiv"> <p>This is a sample content.</p> </div> After Prepend : <div id="sampleDiv"> <p>Initial Text</p> <p>This is a sample content.</p> </div> After Append : <div id="sampleDiv"> <p>Initial Text</p> <p>This is a sample content.</p> <p>End Text</p> </div> Outer HTML Before Modification : <span>Sample Content</span> Outer HTML After Modification : <span>Sample Content</span>
Following example will showcase use of method to set, prepend or append text to a dom element after parsing an HTML String into a Document object.
Document document = Jsoup.parse(html); Element div = document.getElementById("sampleDiv"); div.text("This is a sample content."); div.prepend("Initial Text."); div.append("End Text.");
Where
document − document object represents the HTML DOM.
Jsoup − main class to parse the given HTML String.
html − HTML String.
div − Element object represent the html node element representing anchor tag.
div.text() − text(content) method replaces the element's content with the corresponding value.
div.prepend() − prepend(content) method adds the content before the outer html.
div.append() − append(content) method adds the content after the outer html.
Element object represent a dom elment and provides various method to set, prepend or append html to a dom element.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JsoupTester { public static void main(String[] args) { String html = "<html><head><title>Sample Title</title></head>" + "<body>" + "<div id='sampleDiv'><a id='googleA' href='www.google.com'>Google</a></div>" +"</body></html>"; Document document = Jsoup.parse(html); Element div = document.getElementById("sampleDiv"); System.out.println("Outer HTML Before Modification :\n" + div.outerHtml()); div.text(This is a sample content."); System.out.println("Outer HTML After Modification :\n" + div.outerHtml()); div.prepend("Initial Text."); System.out.println("After Prepend :\n" + div.outerHtml()); div.append("End Text."); System.out.println("After Append :\n" + div.outerHtml()); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Outer HTML Before Modification : <div id="sampleDiv"> <a id="googleA" href="www.google.com">Google</a> </div> Outer HTML After Modification : <div id="sampleDiv"> This is a sample content. </div> After Prepend : <div id="sampleDiv"> Initial Text.This is a sample content. </div> After Append : <div id="sampleDiv"> Initial Text.This is a sample content.End Text. </div>
Following example will showcase prevention of XSS attacks or cross-site scripting attack.
String safeHtml = Jsoup.clean(html, Whitelist.basic());
Where
Jsoup − main class to parse the given HTML String.
html − Initial HTML String.
safeHtml − Cleaned HTML.
Whitelist − Object to provide default configurations to safeguard html.
clean() − cleans the html using Whitelist.
Jsoup object sanitizes an html using Whitelist configurations.
Create the following java program using any editor of your choice in say C:/> jsoup.
JsoupTester.java
import org.jsoup.Jsoup; import org.jsoup.safety.Whitelist; public class JsoupTester { public static void main(String[] args) { String html = "<p><a href='http://example.com/'" +" onclick='checkData()'>Link</a></p>"; System.out.println("Initial HTML: " + html); String safeHtml = Jsoup.clean(html, Whitelist.basic()); System.out.println("Cleaned HTML: " +safeHtml); } }
Compile the class using javac compiler as follows:
C:\jsoup>javac JsoupTester.java
Now run the JsoupTester to see the result.
C:\jsoup>java JsoupTester
See the result.
Initial HTML: <p><a href='http://example.com/' onclick='checkData()'>Link</a></p> Cleaned HTML: <p><a href="http://example.com/" rel="nofollow">Link</a></p>