Jsoup preserve new lines example shows how to preserve new lines while using Jsoup to parse HTML. Example also shows how to preserve newlines characters having \n, <br> and <p> tags.
How to preserve newlines while using Jsoup?
Jsoup removes the newline character “\n” by default from the HTML. It also does not retain new lines created by “<br>” or “<p>” tags either. Consider below given example.
1 2 3 4 5 |
String strHTML = "<html><body>Hello\nworld</body></html>"; String str = Jsoup.parse(strHTML).text(); System.out.println(str); |
Output
1 |
Hello World |
As you can see from the output, Jsoup replaced “\n” with a space character. To prevent Jsoup from removing the new line characters, you can change the OutputSetting of the Jsoup and turn pretty print off as given below.
1 2 3 4 5 6 |
OutputSettings settings = new OutputSettings(); settings.prettyPrint(false); String str = Jsoup.clean(strHTML, "", Whitelist.none(), settings); System.out.println(str); |
Output
1 2 |
Hello World |
We cleaned the input HTML using clean
method (clean HTML full example). We provided a whitelist as none, so it removed all the HTML tags from the HTML string. In OutputSetting we specified pretty print as false, which prevented Jsoup from removing the new line characters.
How to retain new lines created by <br> and <p> tags?
Many a times, new line is created by <br> and <p> tags in HTML output. While cleaning the HTML using Jsoup using clean
method, it removes such new lines. The example given below shows how to retain such new lines.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
package com.javacodeexamples.libraries.jsoup; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Document.OutputSettings; import org.jsoup.safety.Whitelist; public class JsoupPreserveLineBreaksExample { public static void main(String[] args) { String strHTML = "<html><body>" + "Hello\nworld" + "<br>" + "HiWorld" + "<p>Paragraph</p> completed" + "</body></html>"; //create Jsoup document from HTML Document jsoupDoc = Jsoup.parse(strHTML); //set pretty print to false, so \n is not removed jsoupDoc.outputSettings(new OutputSettings().prettyPrint(false)); //select all <br> tags and append \n after that jsoupDoc.select("br").after("\\n"); //select all <p> tags and prepend \n before that jsoupDoc.select("p").before("\\n"); //get the HTML from the document, and retaining original new lines String str = jsoupDoc.html().replaceAll("\\\\n", "\n"); String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), new OutputSettings().prettyPrint(false)); System.out.println(strWithNewLines); } } |
Output
1 2 3 4 |
Hello world HiWorld Paragraph completed |
If you want to learn more about the Jsoup library, please visit the Jsoup tutorial with examples.
Please let me know your views in the comments section below.
Wow thank you so much!
I am glad it helped. Thanks.
What version of jsoup was used
Hi Vashish,
JSoup 1.8.3 version was used in this example.
Hi Rahim,
It is a nice way to deal with tags to extract info!
However, when I take this approach, I lose my html DOM structure when applied whitelist.none().
Please suggest on how to retain the DOM structure and still accomplish this ?
Thanks!
Hi Sridhar,
Thanks for the comment.
The clean method removes all the HTML tags except for the tags mentioned in the whitelist. Since you provided none as the white list, the clean method removed all the HTML tags.
Use the approach given in this example when you want to extract the text of the whole HTML document. If you want to maintain the DOM structure and are interested in extracting text from specific tags, you can turn the pretty print off, traverse to the specific element using the DOM and then extract the text. Visit the https://www.javacodeexamples.com/jsoup-tutorial-with-examples/1628 page to know how to do that.
I hope this clears your problem.