Jsoup

Jsoup is a Java library for parsing HTML content or pages. Jsoup is an open source library and it is distributed under MIT license. That means you are free to download, use and distribute it.

The real world HTML content may not be well formed, for example, some programmer chooses to write <br> while others prefer <br /> for line breaks in HTML pages. In this situation, parsing the HTML using regular expression or any other means may not yield the desired results. Plus, it will be very error prone and resource intensive to write all such combinations for parsing HTML content.

All these problems can be easily avoided by using a HTML parser instead of trying to parse the content using regular expression. Jsoup is one such good HTML parser.

Below given are some of the main capabilities of the Jsoup parser.

1) Jsoup can parse HTML directly from URL, from file or even from the String variable.
2) Jsoup allows to manipulate structure of the elements like adding, changing or removing elements. It also allows to add/remove attributes easily.
3) Finding data in elements or attribute is very easy using Jsoup.
4) Jsoup supports basic authentication using user name and passwords.
5) If you are behind the proxy, no problem! Jsoup works using proxy as well.
6) Jsoup supports cleaning the HTML. You can specify what tags you want to retain in the parsed HTML using whitelist.
7) Jsoup can output tidy HTML from the parsed HTML.

These are some of the main features of the Jsoup. It provides many other features which are very useful in the real world scenarios. Plus, selecting an element from Jsoup parsed HTML is very easy as it supports jquery styled selectors. For example, to select td from table row from HTML document you can write like document.select("table tr td") which returns all the matching td elements.

You can download the Jsoup library from Jsoup site.

Below given are some of the Jsoup examples which shows how to use jsoup to parse HTML in Java.

Jsoup Examples

Join 1000+ fellow learners! Enter your email address below: