Jsoup preserve new lines example

Jsoup preserve new lines example shows how to preserve new lines while using Jsoup to parse HTML. Example also shows how to preserve newlines characters having \n, and tags.

How to preserve newlines while using Jsoup?

Jsoup removes the newline character “\n” by default from the HTML. It also does not retain new lines created by “ ” or “” tags either. Consider below given example.

String strHTML = "<html><body>Hello\nworld</body></html>";

String str = Jsoup.parse(strHTML).text();

System.out.println(str);

Output

1	Hello World

As you can see from the output, Jsoup replaced “\n” with a space character. To prevent Jsoup from removing the new line characters, you can change the OutputSetting of the Jsoup and turn pretty print off as given below.

OutputSettings settings = new OutputSettings();

settings.prettyPrint(false);

String str = Jsoup.clean(strHTML, "", Whitelist.none(), settings);

System.out.println(str);

Output

1 2	Hello World

We cleaned the input HTML using clean method (clean HTML full example). We provided a whitelist as none, so it removed all the HTML tags from the HTML string. In OutputSetting we specified pretty print as false, which prevented Jsoup from removing the new line characters.

How to retain new lines created by and tags?

Many a times, new line is created by and tags in HTML output. While cleaning the HTML using Jsoup using clean method, it removes such new lines. The example given below shows how to retain such new lines.

package com.javacodeexamples.libraries.jsoup;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Document.OutputSettings;

import org.jsoup.safety.Whitelist;

public class JsoupPreserveLineBreaksExample {

public static void main(String[] args) {

String strHTML = "<html><body>" +

"Hello\nworld" +

" " +

"HiWorld" +

"Paragraph completed" +

"</body></html>";

//create Jsoup document from HTML

Document jsoupDoc = Jsoup.parse(strHTML);

//set pretty print to false, so \n is not removed

jsoupDoc.outputSettings(new OutputSettings().prettyPrint(false));

//select all tags and append \n after that

jsoupDoc.select("br").after("\\n");

//select all tags and prepend \n before that

jsoupDoc.select("p").before("\\n");

//get the HTML from the document, and retaining original new lines

String str = jsoupDoc.html().replaceAll("\\\\n", "\n");

String strWithNewLines =

Jsoup.clean(str, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

System.out.println(strWithNewLines);

}

Output

Hello

world

HiWorld

Paragraph completed

If you want to learn more about the Jsoup library, please visit the Jsoup tutorial with examples.

Please let me know your views in the comments section below.

Related Java Examples

6 comments

Ihor Klimov December 19, 2017 at 4:25 pm


Wow thank you so much!
1. rahimv March 18, 2018 at 9:27 pm
  
  
  I am glad it helped. Thanks.
Vashish April 9, 2019 at 1:00 pm


What version of jsoup was used
1. rahimv April 13, 2019 at 12:15 pm
  
  
  Hi Vashish,
  JSoup 1.8.3 version was used in this example.
Sridhar September 15, 2020 at 3:04 am


Hi Rahim,
It is a nice way to deal with tags to extract info!
However, when I take this approach, I lose my html DOM structure when applied whitelist.none().
Please suggest on how to retain the DOM structure and still accomplish this ?

Thanks!
1. RahimV September 16, 2020 at 5:35 pm
  
  
  Hi Sridhar,
  
  Thanks for the comment.
  
  The clean method removes all the HTML tags except for the tags mentioned in the whitelist. Since you provided none as the white list, the clean method removed all the HTML tags.
  
  Use the approach given in this example when you want to extract the text of the whole HTML document. If you want to maintain the DOM structure and are interested in extracting text from specific tags, you can turn the pretty print off, traverse to the specific element using the DOM and then extract the text. Visit the https://www.javacodeexamples.com/jsoup-tutorial-with-examples/1628 page to know how to do that.
  
  I hope this clears your problem.

Jsoup preserve new lines example

How to preserve newlines while using Jsoup?

How to retain new lines created by <br> and <p> tags?

About the author

6 comments

Leave a Reply Cancel reply