Skip to content

Remove HTML tags from String in Java example

Remove HTML tags from String in Java example shows how to remove HTML tags from String in Java using a regular expression and Jsoup library.

How to remove HTML tags from String in Java?

You can remove simple HTML tags from a string using a regular expression. Usually, HTML tags are enclosed in “<” and “>” brackets, so we are going to use the "<[^>]*>" pattern to match anything between these brackets and replace them with the empty string to remove them.

Example

Output

The above regular expression worked fine except it did not handle the HTML entities like “&nbsp;” and “&amp;”. Depending on the requirement, you can either replace them with the equivalent characters one by one or remove them using "&.*?;" pattern.

Output

How to remove specific HTML tags from the String?

What if you want to remove only a specific HTML tag from String? You can do that using regular expression too. Suppose you want to remove “a” tag from the String “<a href=’#’>HTML<b>Bold</b>link</a>”. You can use the "<[/]?a[^>]*>" pattern to remove that.

Output

Let’s run some more tests to make sure that the pattern works.

Output

HTML is not a strict language. As you can see from the output, our pattern failed when an HTML tag was specified in the upper case or having multiple spaces. Let’s modify the pattern to “(?i)<[\\s]*[/]?[\\s]*a[^>]*>” to cover these scenarios.

Example

Output

Is it recommended to use a regular expression to remove HTML tags from String?

The short answer is NO. Till now we have only seen happy scenarios. Consider below given example HTML string.

Output

Our important text was removed by regular expression because HTML was not well-formed. It is very common to encounter such malformed HTML which cannot be taken care of by a regular expression. Consider another example.

Output

What should I use to remove the HTML tags?

If you are removing a tag or two from the string and you are absolutely certain that the input HTML is well-formed, using regular expression is OK. In all other scenarios, using HTML parser is the way to go.

One such parser is Jsoup. Here is how you can remove the HTML elements from the string using Jsoup example.

Output

The Jsoup library even allows you to whitelist elements in case you want to retain some tags while clearing all others.

This example is a part of the Java String tutorial, Java RegEx Tutorial, and Jsoup Tutorial.

Please let me know your views in the comments section below.

About the author

9 comments

  1. in this program this is possible to remove the tag from HTML string but when i am extracting html from online by using Jsoup them i am not able to apply this pattern{ “&.*?;” } see the following program then please tell me ,how i can full fill my requirement.

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;

    class JavaSoup1
    {
    public static void main(String []args)throws Exception
    {
    Document doc = Jsoup.connect(“http://www.iitbhu.ac.in/”).get();

    String content=doc.text();

    content=content.replaceAll(“]*>”,””);

    content=content.replaceAll(“&.*?;”,””);

    System.out.println(content);
    }
    }

    here all html tags replaced by blank string but this “&.*?;” pattern does not work .

    please tell how to resolve this problem ??

    SANJAY KUMAR GUPTA

    IIT BHU VARANASI

    1. Hi,
      Can you please explain whats the exact problem are you facing? Is it throwing exception or does not replace the HTML entities like & from your source code? Pattern “&.*?;” is meant to replace HTML entities from the source. If source does not contain any, it simply does nothing and should not throw exception.

      1. i am not asking about exception .

        see above program posted by me .

        in this program html tag able to remove but “&#1276” not able to remove by this syntax content=content.replaceAll(“&.*?;”,””);

        1. in this program html tag able to remove but “&#….” not able to remove by this syntax content=content.replaceAll(“&.*?;”,””);

          1. I don’t see a reason why it should not replace the entities. It does, at least in my tests. May be your input is wrong?

            String str = “<b>&nbsp; demo string प </b>”;
            System.out.println(str.replaceAll(“]*>”, “”).replaceAll(“&.*?;”, “”));

            Output
            demo string

  2. import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;

    class JavaSoup1
    {
    public static void main(String []args)throws Exception
    {
    Document doc = Jsoup.connect(“http://www.iitbhu.ac.in/”).get();

    String content=doc.html();

    content=content.replaceAll(“]*>”,””);

    content=content.replaceAll(“&.*?;”,””);

    System.out.println(content);
    }
    }

    this program work for replaceAll(“]*>”,””) syntax but not work for replaceAll(“&.*?;”,””) .

    may you suggest me why ?

    1. Hi Daniel,

      Yes, there are many more hidden surprises like the one you mentioned 🙂
      That is why regex is never a good idea to parse the HTML. The Jsoup library has never failed for me.

Leave a Reply

Your email address will not be published.