Jsoup Tutorial with Examples

Jsoup tutorial with examples will help you understand how to use Jsoup in an easy way. In this Jsoup tutorial, I will show you how web scraping was never been easier using Jsoup examples. Jsoup is an open-source library for parsing HTML content and web scraping which is distributed under MIT license. That means you are free to download, use and distribute it.

Why you should use the Jsoup instead of regular expressions for web scraping?

The real-world HTML content may not be well-formed, for example, some programmers choose to write <br> while others prefer <br /> for line breaks in HTML pages. In this situation, parsing the HTML using regular expression will not yield the desired results or becomes too complicated. Plus, it will be very error-prone and resource-intensive to write all such combinations for parsing HTML content.

All these problems can be easily avoided by using an HTML parser like Jsoup instead of trying to parse the content using regular expressions.

Below given are some of the main capabilities of the Jsoup parser.

  • Jsoup can parse HTML directly from URL, from file or even from the String variable.
  • Jsoup allows HTML element structure manipulation like adding, changing or removing elements. It also allows adding and removing attributes easily.
  • Finding data in elements or attributes is very easy using Jsoup.
  • Jsoup supports basic authentication using a user name and password.
  • If you are behind the proxy, no problem! Jsoup works with proxy as well.
  • Jsoup supports cleaning the HTML. You can specify what tags you want to retain in the parsed HTML using the whitelist.
  • Jsoup can output tidy HTML from the parsed HTML.

These are some of the main features of the Jsoup. It provides many other features that are very useful in real-world scenarios. Plus, selecting an element from Jsoup parsed HTML is very easy as it supports jquery styled selectors. For example, to select all td elements from all the table rows of an HTML document, you can write a selector like document.select("table tr td") which returns all the matching td elements.

How to download and use the Jsoup in your project?

You can download the binary distribution (Jsoup jar file) directly from the download section of the Jsoup website. Once you download the library, put it in your build path to start using it. If you use Maven in your project, mention the following Jsoup maven dependency.

Jsoup Maven:

Jsoup Gradle:

Jsoup does not have any other dependencies.

How to parse HTML using Jsoup?

Jsoup is capable of scraping and parsing HTML content from a file, a URL, or string. I will show you each one.

How to parse HTML from a URL using Jsoup?

Use the connect method of the Jsoup class to connect to a URL and get method to get and parse HTML from the given URL.

Output

I have truncated the above given output.

How to parse HTML from a file (local file)?

If you have a local file containing the HTML and you want to parse it, you can use the parse method of the Jsoup class.

The above given parse method uses the location of the file to resolve any relative URLs given in the HTML file. For example, if you have downloaded the HTML content of the domain http://www.example.com and it contains a reference to an image like ‘/favicon.ico’ using the relative URLs. Now if you want to download this image while parsing the HTML, you need an absolute URL of the image. If you have downloaded that image and placed it at the same location as the HTML file, it is fine.

However, if you just downloaded an HTML file and you need to fetch all the other resources from the domain, you need to use below given overloaded parse method with baseURI parameter.

Now all the relative URLs found in the HTML document will be considered relative to the mentioned baseURI.

How to parse HTML from a String?

If you want to parse HTML from a Java String, use the parse method having a String argument.

Output

Again, as given above, you can use the overloaded parse method having string content and baseURI parameters to resolve any relative URLs given in the string HTML.

Understanding the Jsoup Connection, Request, and Response

The Connection interface of the Jsoup package provides methods for connecting and fetching URLs, executing GET and POST requests, and getting the Request and Response objects. All the configuration related to HTTP requests needs to be configured using the Connection.

Below given are some of the basic HTTP configurations you can do with the Jsoup Connection.

How to follow server redirects?

If you request a webpage that has been moved to another location, the server sends HTTP 301 or 302 redirect response specifying the new location of the webpage. The Jsoup connection follows the server redirects by default to fetch the requested document from the new URL. If you want to turn it off, use the followRedirects method and pass false.

Example:

How to send request headers?

Use the header method to set the request header.

Example:

The above example sends header1 and header2 headers while requesting the URL. If you have multiple headers stored in a Map object, you can use the headers method to specify all the headers at once instead of invoking the header method multiple times as given below.

How to ignore the document’s Content-type?

Jsoup takes the document’s content type in to account while parsing the response to prevent IOException for unrecognized content types. If you want to parse the response regardless of the document’s content type, use the ignoreContentType method and pass true (default is false).

Example:

How to ignore HTTP error codes while making a connection?

Jsoup throws IOException if the request results in HTTP errors like “404 – Not found”,  “5xx – Internal server error”, or any other HTTP errors. If you want to ignore these HTTP errors, you can use the ignoreHTTPErrors method and pass true parameter.

Example:

This will cause the response to be populated with the error body, and connection status will reflect the error if the connection results in any of the errors mentioned above.

How to set the proxy for Jsoup connection?

If you connect to the internet using the proxy server, the Jsoup connection also needs to be configured to use that proxy too. There are several ways to configure the proxy for Jsoup, but the simplest one is to use the built-in proxy method as given below.

This method sets the specified host and port as a proxy for the current request.

Example:

Visit the full how to set Jsoup proxy example to know about different options.

How to set the request referrer (referer) header?

Many webservers check for the request referrer before serving the content. If the referer header is missing, they may send the error instead of the requested HTML document. In this case, you can send the referer header along with the request using the referrer method.

This method sets the referer header with the given string value.

The above example sets the HTTP referer header as “http://www.example.com” while requesting the “http://www.example.com/page1” HTML page. Refer to the full example of how to set Jsoup referer to know more.

How to set the user-agent header?

Just like the referrer, many web servers send back the 5xx forbidden error or internal server error if the HTTP request does not contain a valid user agent. It also happens if the user-agent header is empty, user-agent matches with the known spam bots, or if the server detects that it is machine generated request.

You can set the user-agent header for the request using the userAgent method as given below.

Refer to the full example of how to set user agent using Jsoup to know more.

Tip: Always set the HTTP referrer and user-agent headers when web scraping to avoid forbidden and internal server error responses. Plus, always make sure to wait for at least a couple of seconds before making consecutive requests. Please refer to the example on how to fix 403 – Forbidden error while using Jsoup.

How to set the request timeout?

The default request time out for Jsoup is 30 seconds. It means the Jsoup will wait for 30 seconds for the response to be received before throwing the SocketTimeOutException exception. If you want to specify the custom duration, use the timeout method.

Note that the timeout is in milliseconds. Please refer to the example on how to fix ConnectionError: UnsupportedMimeTypeException and how to fix SocketTimeOutException while using the Jsoup to know more.

How to get a Response object from the connection?

The response object is useful in retrieving useful information about the response received from the Jsoup connection like response body, cookies, etc. The execute method of the Connection executes the request and returns a response as given below.

How to get cookies from the response?

Web servers often send cookies back to the browser in response to the HTTP requests, for example, login cookie or a cookie containing the last visited page. You can get these cookies using the cookies method of the Response class as given below.

Output

How to send cookies in a request?

If you want to send the cookie along with the HTTP request, use the cookie method of the Connection.

Jsoup request cookie example:

The above example will send cookie “mycookie” along with the request. If you have multiple cookies, you can store them in a Map object and send it in the HTTP request using the cookies method as given below.

How to set the request method (GET or POST)?

If you do not set any HTTP method for the Jsoup connection, the default method for Jsoup request is a GET method. If you want to set the HTTP method explicitly, use the method method of the Connection.

Connection.Method is an enum that defines below given constants, one for each valid HTTP method.

The below given example will send a HTTP POST request to the given URL.

How to send GET or POST request parameters?

The most common thing one needs to do while scraping the websites is to pass request parameters. If the HTTP request method is GET method, the parameters are appended to URL like “http://www.example.com?param1=val1&param2=val2”. Here the question mark (?) separates the URL from the GET parameters and each individual parameter is separated by ampersand sign (&). The whole string “param1=val1&param2=val2” containing parameters and their values is called query string which is visible in the browser’s URL bar.

If the request method is POST, parameters are sent in the request body an not visible in the URL bar of the browser.

Jsoup supports sending the URL parameters regardless of the method being used. Use the data method of the Connection to send the parameter name-value pairs.

This method adds a request parameter to the current HTTP request.

You can also use a Map object containing all parameter name and values with overloaded data method to send all parameters at once as given below.

Please refer to the full example of how to post form data using Jsoup example to know more.

Putting it all together

Most of the methods of the Connection mentioned above return back the Connection object so that we can chain them together in a single call as given in the below example. This is more or less how your connection code should look like depending on your requirements.

Special cases while connecting to a URL:

  1. If the webpage you want to scrape needs basic authentication using a username and password, please refer to how to do basic authentication using Jsoup example.
  2. If the website you want to scrape needs login, please refer to how to login to a website using Jsoup example.

Understanding the Attribute, Node, Element, and Document classes

Now that we have seen how to connect to a URL and get a response using the Jsoup, in this part of the Jsoup tutorial I will show you how to parse the response and extract data from the HTML.

There are 4 main Jsoup classes we need to understand for scaping a webpage and extracting data from it. These classes are Attribute, Node, Element, and Document class. Here is the class hierarchy of them.

Jsoup class hierarchy

Once you get the Document object from the response, Jsoup provides DOM like methods, for example, getElementById or getElementsByTag, to extract the data from the HTML. Jsoup also supports very simple but more powerful JQuery or CSS like selectors to extract the data from the HTML. I will show how to use both of them.

How to navigate the HTML document and find elements using Jsoup?

I will be using the below given example HTML code to extract the data for the rest of the tutorial. I have saved this file at the E:/example-html-file.html location.

I am loading the local HTML file using the code given below and will be using the same Document object for extracting the data from it.

How to select HTML tags by id?

Output

How to select HTML tags by name?

Output

How to select HTML tags by CSS class name?

Output

You can also specify multiple class names while extracting the data using the Jsoup as given below.

Output

You can also fetch all the HTML tags having a specified class name or specified id.

Output

How to select HTML elements by attributes?

Select elements having specified attribute:

Output

You can also select a specified element having a specified attribute as given below.

Output

Select elements having an attribute name starting with a text:

Output

Select elements having specified attribute with given value:

Output

Select elements having a specified attribute with a matching value:

Output

How to get children of the HTML elements using the Jsoup?

Output

The below given code will select all img child elements of the body element.

Output

If you want to select only direct child elements, use the following syntax.

Output

How to get siblings of HTML elements using Jsoup?

Output

How to select a parent element of an element?

Output

Advanced Pseudo Selectors

The Jsoup selector offers advanced Pseudo selectors to find elements. Finding this elements is not possible or easy using the DOM style as given below.

How to select elements with sibling index less than, greater than or equal to the given index?

Output

How to find elements containing specified other elements?

The below example shows how to find elements containing a specific element, for example, all link elements containing images.

Output

How to find elements not matching the specified selector?

Output

How to find elements containing specified text?

Output

The above given :contains selector returns an element if any of the child elements have the matching text. If you want to search the element text only excluding the child element text, use the :containsOwn selector instead of the :contains selector.

How to find elements containing text matching with regex?

Output

Use the :matchesOwn to match the text of the given element only, excluding the text of the child elements.

There are many more interesting selectors which I am skipping to keep the length of this tutorial reasonable. You can refer to them at Jsoup selector syntax page.

How to extract data from HTML using the Jsoup?

Once you have found the elements you want to extract the data from, its fairly easy task to extract the data.

How to get the id of an element?

Use the id method to get the id attribute of the HTML element.

Output

How to get the tag name of an element?

Use the tagName method to get the tag name of the element.

Output

How to get CSS class names of an element?

Use the className method to get the value of the class attribute of the element. If the element has multiple classes, they are returned in space separated format.

Output

As you can see from the output, in the case of multiple classes, the class names are returned in the same string separated by space. If you want the individual class names, use the classNames method as given below.

Output

The classNames method returns a Set of String elements containing individual class names. If the element contains duplicate class names in the class attribute, they will be removed (because the Set does not allow duplicate elements).

How to get the text of an element?

Use the text method to get the element text.

Output

How to get the inner HTML of an element?

Use the html method to get the element’s inner HTML code.

Output

How to get the outer HTML of an element?

Use the outerHTML method to get the element’s outer HTML code.

Output

Similarly, you can use the toString method to get the outer HTML of the element(s).

How to get the attribute value of a specific attribute of any element?

Use the attr method to get the value of the specified attribute of the given element.

Output

How to get all attributes of an element?

Use the attributes method to get all the attributes of an element.

Output

Apart from these methods to extract the data from HTML elements, Jsoup also provides methods to manipulate or change the DOM, but those methods are beyond the scope of this tutorial. You can learn it at Jsoup site.

Below given are some additional Jsoup examples which cover the individual topics in more detail.

Jsoup Examples

Please let me know if you liked the Jsoup tutorial with examples in the comments section below.

About the author

RahimV

RahimV

My name is RahimV and I have over 16 years of experience in designing and developing Java applications. Over the years I have worked with many fortune 500 companies as an eCommerce Architect. My goal is to provide high quality but simple to understand Java tutorials and examples for free. If you like my website, follow me on Facebook and Twitter.

Add Comment

Your email address will not be published. Required fields are marked *