Remove HTML tags from String in Java example

August 3, 2021
9 Comments

Remove HTML tags from String in Java example shows how to remove HTML tags from String in Java using a regular expression and Jsoup library.

How to remove HTML tags from String in Java?

You can remove simple HTML tags from a string using a regular expression. Usually, HTML tags are enclosed in “<” and “>” brackets, so we are going to use the "<[^>]*>" pattern to match anything between these brackets and replace them with the empty string to remove them.

< - start bracket

[^>] - followed by any character which is not closing bracket ">"

* - zero or more times

> - followed by closing bracket

Example

package com.javacodeexamples.stringexamples;

public class RemoveHTMLTagsFromStringExample {

public static void main(String[] args) {

String[] strHTMLTexts = {

"<a href=\"#\">HTML Link</a>",

"<table><tr><td>column1</td></tr></table>",

"<script>alert('javascript');</script>",

"<br />< BR >line break<bR/><br>",

"<b>bold text</b>",

"  Jack & Jones",

"<script>"

};

//match HTML tags

String strRegEx = "<[^>]*>";

//replace them with empty string to remove them

for(String str : strHTMLTexts){

System.out.println( str.replaceAll(strRegEx, "") );

}

Output

HTML Link

column1

alert('javascript');

line break

bold text

  Jack & Jones

The above regular expression worked fine except it did not handle the HTML entities like “ ” and “&”. Depending on the requirement, you can either replace them with the equivalent characters one by one or remove them using "&.*?;" pattern.

& - & character

.* - followed by any character

?; - followed by semicolon

String[] strHTMLTexts = {

"<a href=\"#\">HTML Link</a>",

"<table><tr><td>column1</td></tr></table>",

"<script>alert('javascript');</script>",

"<br />< BR >line break<bR/><br>",

"<b>bold text</b>",

"  Jack & Jones",

"<script>"

};

String strRegEx = "<[^>]*>";

for(String str : strHTMLTexts){

str = str.replaceAll(strRegEx, "");

//replace   with space

str = str.replace(" ", " ");

//replace & with &

str = str.replace("&", "&");

//OR remove all HTML entities

str = str.replaceAll("&.*?;", "");

System.out.println(str);

}

Output

HTML Link

column1

alert('javascript');

line break

bold text

Jack & Jones

script

How to remove specific HTML tags from the String?

What if you want to remove only a specific HTML tag from String? You can do that using regular expression too. Suppose you want to remove “a” tag from the String “<a href=’#’>HTML<b>Bold</b>link</a>”. You can use the "<[/]?a[^>]*>" pattern to remove that.

< - opening bracket

[/]? - followed by zero or one “/” to match closing tag

a - followed by “a” character

[^>] - followed by any character which is not closing bracket ">"

* - zero or more times

> - followed by closing bracket ">"

String strHtml = "<a href='#'>HTML<b>Bold</b>link</a>";

String strRegEx = "<[/]?a[^>]*>";

System.out.println( strHtml.replaceAll(strRegEx, "") );

Output

1	HTML<b>Bold</b>link

Let’s run some more tests to make sure that the pattern works.

String[] strHtmlLinks = {

"<a href='#'>HTML<b>Bold</b>link</a>",

"<A href=''></A>",

"< a href='#'></ a >",

"< a href='#'>< / a >"

};

String strRegEx = "<[/]?a[^>]*>";

for(String html : strHtmlLinks)

System.out.println(html.replaceAll(strRegEx, ""));

Output

HTML<b>Bold</b>link

< a href='#'></ a >

< a href='#'>< / a >

HTML is not a strict language. As you can see from the output, our pattern failed when an HTML tag was specified in the upper case or having multiple spaces. Let’s modify the pattern to “(?i)<[\\s]*[/]?[\\s]*a[^>]*>” to cover these scenarios.

(?i) - case insensitive comparison

< - opening bracket "<"

[\\s]* - followed by zero or more spaces

[/]? - followed by zero or one "/"

[\\s]* - followed by zero or more spaces

a - followed by "a"

[^>] - followed by any character which is not closing bracket ">"

* - zero or more times

> - followed by closing bracket ">"

Example

String[] strHtmlLinks = {

"<a href='#'>HTML<b>Bold</b>link</a>",

"<A href=''>Link</A>",

"< a href='#'>Link</ a >",

"< a href='#'>Link< / a >"

};

String strRegEx = "(?i)<[\\s]*[/]?[\\s]*a[^>]*>";

for(String html : strHtmlLinks)

System.out.println( html.replaceAll(strRegEx, "") );

Output

HTML<b>Bold</b>link

Link

Is it recommended to use a regular expression to remove HTML tags from String?

The short answer is NO. Till now we have only seen happy scenarios. Consider below given example HTML string.

String strHtml = "<b Very important text</b>Gone!";

String strRegEx = "<[^>]*>";

System.out.println( strHtml.replaceAll( strRegEx, "") );

Output

Gone!

Our important text was removed by regular expression because HTML was not well-formed. It is very common to encounter such malformed HTML which cannot be taken care of by a regular expression. Consider another example.

String strHtml = "<strong>Maths: a < b & b > c</strong>";

String strRegEx = "<[^>]*>";

System.out.println( strHtml.replaceAll(strRegEx, "") );

Output

1	Maths: a c

What should I use to remove the HTML tags?

If you are removing a tag or two from the string and you are absolutely certain that the input HTML is well-formed, using regular expression is OK. In all other scenarios, using HTML parser is the way to go.

One such parser is Jsoup. Here is how you can remove the HTML elements from the string using Jsoup example.

String strHtml = "<strong>Maths: a < b & b > c</strong>";

String text = Jsoup.parse(strHtml).text();

System.out.println(text);

Output

1	Maths: a < b & b > c

The Jsoup library even allows you to whitelist elements in case you want to retain some tags while clearing all others.

This example is a part of the Java String tutorial, Java RegEx Tutorial, and Jsoup Tutorial.

Please let me know your views in the comments section below.

9 comments

SANJAY GUPTA July 9, 2016 at 11:16 am

Reply

in this program this is possible to remove the tag from HTML string but when i am extracting html from online by using Jsoup them i am not able to apply this pattern{ “&.*?;” } see the following program then please tell me ,how i can full fill my requirement.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

class JavaSoup1
{
public static void main(String []args)throws Exception
{
Document doc = Jsoup.connect(“http://www.iitbhu.ac.in/”).get();

String content=doc.text();

content=content.replaceAll(“]*>”,””);

content=content.replaceAll(“&.*?;”,””);

System.out.println(content);
}
}

here all html tags replaced by blank string but this “&.*?;” pattern does not work .

please tell how to resolve this problem ??

SANJAY KUMAR GUPTA

IIT BHU VARANASI
1. rahimv July 12, 2016 at 2:47 pm
  
  Reply
  
  Hi,
  Can you please explain whats the exact problem are you facing? Is it throwing exception or does not replace the HTML entities like & from your source code? Pattern “&.*?;” is meant to replace HTML entities from the source. If source does not contain any, it simply does nothing and should not throw exception.
2. 1. SANJAY GUPTA July 13, 2016 at 12:43 pm
    
    Reply
    
    i am not asking about exception .
    
    see above program posted by me .
    
    in this program html tag able to remove but “&#1276” not able to remove by this syntax content=content.replaceAll(“&.*?;”,””);
  2. 1. SANJAY GUPTA July 13, 2016 at 12:44 pm
      
      Reply
      
      in this program html tag able to remove but “&#….” not able to remove by this syntax content=content.replaceAll(“&.*?;”,””);
    2. 1. rahimv July 30, 2016 at 12:26 pm
        
        Reply
        
        I don’t see a reason why it should not replace the entities. It does, at least in my tests. May be your input is wrong?
        
        String str = “<b>  demo string प </b>”;
        System.out.println(str.replaceAll(“]*>”, “”).replaceAll(“&.*?;”, “”));
        
        Output
        demo string
  3. SANJAY GUPTA July 13, 2016 at 12:49 pm
    
    Reply
    
    program not throwing exception but it is not removing the text like”&#….” from source code
SANJAY GUPTA July 9, 2016 at 12:21 pm

Reply

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

class JavaSoup1
{
public static void main(String []args)throws Exception
{
Document doc = Jsoup.connect(“http://www.iitbhu.ac.in/”).get();

String content=doc.html();

content=content.replaceAll(“]*>”,””);

content=content.replaceAll(“&.*?;”,””);

System.out.println(content);
}
}

this program work for replaceAll(“]*>”,””) syntax but not work for replaceAll(“&.*?;”,””) .

may you suggest me why ?
Daniel July 23, 2019 at 6:09 pm

Reply

Good luck when you have > in a HTML attribute like class, id or data ;D
1. RahimV November 13, 2019 at 9:52 pm
  
  Reply
  
  Hi Daniel,
  
  Yes, there are many more hidden surprises like the one you mentioned 🙂
  That is why regex is never a good idea to parse the HTML. The Jsoup library has never failed for me.

Remove HTML tags from String in Java example

How to remove HTML tags from String in Java?

How to remove specific HTML tags from the String?

Is it recommended to use a regular expression to remove HTML tags from String?

What should I use to remove the HTML tags?

About the author

9 comments

Leave a Reply Cancel reply

Remove HTML tags from String in Java example

How to remove HTML tags from String in Java?

How to remove specific HTML tags from the String?

Is it recommended to use a regular expression to remove HTML tags from String?

What should I use to remove the HTML tags?

About the author

Related Java Examples

9 comments

Leave a Reply Cancel reply