Java String Handling RegEx

Remove HTML tags from String in Java example

Remove HTML tags from String in Java example shows how to remove HTML tags from String in Java using regular expression. Example also shows how to remove HTML tags from String using Jsoup library.

How to remove HTML tags from String in Java?

You can remove simple HTML tags from String using regular expression. HTML tags are enclosed in “<” and “>” brackets, so we are going to use "<[^>]*>" pattern to match anything between these brackets and replace them with the empty string to remove them.

Example

Output

Above regular expression worked fine except it did not handle HTML entities like “&nbsp;” and “&amp;”. Depending on the requirement, you can either replace them with the equivalent characters one by one, or remove them using "&.*?;" pattern where,

Output

How to remove specific HTML tag from String in Java?

What if you want to remove only specific HTML tag from String? You can do that using regular expression too. Suppose you want to remove “a” tag from the String “<a href=’#’>HTML<b>Bold</b>link</a>”. You can use "<[/]?a[^>]*>" pattern to remove that where,

Output

Let’s run some more tests to make sure that the pattern works.

Output

HTML is not a strict language. As you can see from the output, our pattern failed when HTML tag was specified in upper case or having multiple spaces. Let’s modify the pattern to "(?i)<[\\s]*[/]?[\\s]*a[^>]*>" to cover these scenarios where,

Example

Output

Is it recommended to use regular expression to remove HTML tags from String?

Short answer is NO. Till now we have only seen happy scenarios. Consider below given example HTML string.

Output

Our important text was removed by regular expression because HTML was not well-formed. It is very common to encounter such malformed HTML which cannot be taken care by regular expression. Consider another example.

Output

What should I use to remove HTML tags?

If you are removing a tag or two from the String and you are absolutely certain that the input HTML is well-formed, using regular expression is OK. In all other scenarios, using HTML parser is the way to go.

One such parser is Jsoup. Here is how you can remove HTML text from string using Jsoup example.

Output

Please let us know your views in the comments section below.

About the author

rahimv

rahimv

rahimv has over 15 years of experience in designing and developing Java applications. His areas of expertise are J2EE and eCommerce. If you like the website, follow him on Facebook, Twitter or Google Plus.

  • SANJAY GUPTA

    in this program this is possible to remove the tag from HTML string but when i am extracting html from online by using Jsoup them i am not able to apply this pattern{ “&.*?;” } see the following program then please tell me ,how i can full fill my requirement.

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;

    class JavaSoup1
    {
    public static void main(String []args)throws Exception
    {
    Document doc = Jsoup.connect(“http://www.iitbhu.ac.in/”).get();

    String content=doc.text();

    content=content.replaceAll(“]*>”,””);

    content=content.replaceAll(“&.*?;”,””);

    System.out.println(content);
    }
    }

    here all html tags replaced by blank string but this “&.*?;” pattern does not work .

    please tell how to resolve this problem ??

    SANJAY KUMAR GUPTA

    IIT BHU VARANASI

    • rahimv

      Hi,
      Can you please explain whats the exact problem are you facing? Is it throwing exception or does not replace the HTML entities like & from your source code? Pattern “&.*?;” is meant to replace HTML entities from the source. If source does not contain any, it simply does nothing and should not throw exception.

      • SANJAY GUPTA

        i am not asking about exception .

        see above program posted by me .

        in this program html tag able to remove but “&#1276” not able to remove by this syntax content=content.replaceAll(“&.*?;”,””);

        • SANJAY GUPTA

          in this program html tag able to remove but “&#….” not able to remove by this syntax content=content.replaceAll(“&.*?;”,””);

          • rahimv

            I don’t see a reason why it should not replace the entities. It does, at least in my tests. May be your input is wrong?

            String str = “<b>&nbsp; demo string प </b>”;
            System.out.println(str.replaceAll(“]*>”, “”).replaceAll(“&.*?;”, “”));

            Output
            demo string

      • SANJAY GUPTA

        program not throwing exception but it is not removing the text like”&#….” from source code

  • SANJAY GUPTA

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;

    class JavaSoup1
    {
    public static void main(String []args)throws Exception
    {
    Document doc = Jsoup.connect(“http://www.iitbhu.ac.in/”).get();

    String content=doc.html();

    content=content.replaceAll(“]*>”,””);

    content=content.replaceAll(“&.*?;”,””);

    System.out.println(content);
    }
    }

    this program work for replaceAll(“]*>”,””) syntax but not work for replaceAll(“&.*?;”,””) .

    may you suggest me why ?

Join 1000+ fellow learners! Enter your email address below: