Skip to content

Jsoup preserve new lines example

Jsoup preserve new lines example shows how to preserve new lines while using Jsoup to parse HTML. Example also shows how to preserve newlines characters having \n, <br> and <p> tags.

How to preserve newlines while using Jsoup?

Jsoup removes the newline character “\n” by default from the HTML. It also does not retain new lines created by “<br>” or “<p>” tags either. Consider below given example.

Output

As you can see from the output, Jsoup replaced “\n” with a space character. To prevent Jsoup from removing the new line characters, you can change the OutputSetting of the Jsoup and turn pretty print off as given below.

Output

We cleaned the input HTML using clean method (clean HTML full example). We provided a whitelist as none, so it removed all the HTML tags from the HTML string. In OutputSetting we specified pretty print as false, which prevented Jsoup from removing the new line characters.

How to retain new lines created by <br> and <p> tags?

Many a times, new line is created by <br> and <p> tags in HTML output. While cleaning the HTML using Jsoup using clean method, it removes such new lines. The example given below shows how to retain such new lines.

Output

If you want to learn more about the Jsoup library, please visit the Jsoup tutorial with examples.

Please let me know your views in the comments section below.

About the author

We are on YouTube!

6 comments

  1. Hi Rahim,
    It is a nice way to deal with tags to extract info!
    However, when I take this approach, I lose my html DOM structure when applied whitelist.none().
    Please suggest on how to retain the DOM structure and still accomplish this ?

    Thanks!

    1. Hi Sridhar,

      Thanks for the comment.

      The clean method removes all the HTML tags except for the tags mentioned in the whitelist. Since you provided none as the white list, the clean method removed all the HTML tags.

      Use the approach given in this example when you want to extract the text of the whole HTML document. If you want to maintain the DOM structure and are interested in extracting text from specific tags, you can turn the pretty print off, traverse to the specific element using the DOM and then extract the text. Visit the https://www.javacodeexamples.com/jsoup-tutorial-with-examples/1628 page to know how to do that.

      I hope this clears your problem.

Leave a Reply

Your email address will not be published.