Skip to content

Jsoup clean HTML example

Jsoup clean HTML example shows how to clean HTML using Jsoup. The example also shows how to remove HTML tags from String and retain specific tags using a whitelist while cleaning the HTML using Jsoup.

How to remove HTML tags by cleaning the HTML using Jsoup?

You can remove HTML tags from String using the clean method of the Jsoup.

This method removes all HTML tags from the HTML string while retaining the tags included in the specified whitelist. By default, Jsoup provides the below-given whitelists out of the box.

1) none
All HTML tags are removed except for the text nodes.

2) simpleText
This whitelist allows only text formatting HTML tags b, em, i, strong and u. All other tags are removed.

3) basic
Basic whitelist allows a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul tags. All other tags are removed. It does not allow images.

4) basicWithImages
As the name suggests, this whitelist allows all tags included in the basic whitelist plus image (img tag).

5) relaxed
This is the most accommodating whitelist which allows a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul tags.

How to clean HTML using a whitelist?

Create an appropriate whitelist object and use it along with the clean method to clean the HTML and retain tags specified in the whitelist as given below.

Output

How to retain specific tags while cleaning the HTML document?

Default whitelists come with pre-configured tags. What if you want to retain particular tags only and remove all other HTML tags? Whitelist provides addTags method using which you can add as many tags as you want to retain them as given below.

This method adds HTML tags to the whitelist.

The below example shows how to retain only <div> tags and remove all other HTML tags from the HTML String.

Output

Please also see how to remove HTML tags from a string in Java using the Jsoup example.

This example is a part of the Jsoup tutorial with examples.

Please let me know your views in the comments section below.

About the author

1 comments

  1. Hi,
    Is there a solution to remove elements in a given context : bold in bold for example ?

    Example : if I have :
    <b>text <b>1</b><b> text</b> <b>2</b></b>

    The result after cleaning should be :
    <b>text 1 text 2</b>

Leave a Reply

Your email address will not be published.