Skip to content

Java Regular Expression Tutorial with Examples (RegEx)

Java regular expression tutorial with examples (regex) will help you understand how to use the regular expressions in Java. Java regular expressions are sometimes also called Java regex and it is a powerful way to find, match, and extract data from character sequence.

There are two main classes in the java.util.regex package namely Pattern and Matcher class. The Pattern represents a compiled regular expression while the Matcher is an engine that matches character sequence with the pattern. If the pattern has a syntax error, the PatternSyntaxException is thrown to indicate that.

Java Regular Expressions RegEx
I have divided this Java regular expression tutorial (regex tutorial) into three parts understanding the Pattern class, understanding the Matcher class, and how to create the actual regular expression patterns to find and extract the data.

1. The Pattern class

The Pattern represents a compiled regular expression. The string containing regular expression must be compiled to the instance of the Pattern class. Once we have the instance of the Pattern class, we can then create a Matcher object to match the character sequence against this pattern.

How to compile a regular expression pattern using the compile method?

The static compile method of the Pattern class compiles the given string regular expression into a pattern.

The above code compiles a regular expression into a pattern.

There is an overloaded compile method that accepts special flags along with the regex.

Here are some of the important static int fields defined by the Pattern class which can be mentioned as the flags.

Pattern.UNIX_LINES This flag enables Unix line mode where only ‘\n’ is recognized as a line terminator.
Pattern.CASE_INSENSITIVE The default pattern matching is case sensitive. This flag enables case insensitive matching.
Pattern.COMMENTS This flag allows white spaces and comments in patterns.
Pattern.MULTILINE This flag enables the multiline mode.
Pattern.LITERAL This flat enables the literal parsing of the pattern. Any meta-characters or escape sequences will be treated as literal characters and will lose the special meanings.
Pattern.DOTALL This flag enables the “dotall” mode where the “.” (dot) character matches any character including the line terminator. By default, the line terminator is not matched.

The Pattern flags method returns the flags that were used to create this pattern object.

How to get the regex pattern used to create the Pattern object?

The Pattern pattern method returns the string containing the regular expression which is used to create this pattern object. We can also use the toString method to get the string representing the regular expression which was used to create this pattern object.

How to create a matcher from the pattern?

The Pattern matcher method creates and returns a Matcher object that will match the specified character sequence against this pattern.

Output

How to match a pattern in a literal way?

Consider below given example.

Output

The string “|” did not match the pattern “|” even though it should. It is because there are several characters in the regular expressions that have special meanings. These characters are called metacharacters. The pipe character (|) is a meta-character and means OR in regular expression which is why it did not match.

If you want to match the pattern literally you can use the quote method of the Pattern class. It returns a literal pattern string from the specified regular expression string.

Output

How to split a string using the split method of the Pattern class?

The split method of the Pattern class splits the given character sequence around the matches of this pattern. It returns an array of the String containing the parts.

Output

Please visit the full how to split a string example to know more.

2. The Matcher class

The Matcher class is a regex engine that matches the pattern with the given input string sequence. A matcher object can be created using the matcher method of the Pattern class. Once created, the matcher object can be used to match the input string with the pattern.

The Matcher class provides 3 types of matching operations through matches, find, and lookingAt methods.

How to match the entire string with a pattern using the matches method?

The Matcher matches method returns true if the matcher’s pattern matches the entire input string.

It returns true if and only if the entire input sequence matches the pattern.

Output

The code given above is the same as calling the matches method of the String class or the matches method of the Pattern class.

Please note that the matches method matches the pattern against the entire input sequence. It returns false for the partial matches or no matches.

Output

 How to match a partial input string with a pattern using the find method?

The Matcher find method finds the next subsequence or substring within the input string that matches the pattern.

The find method starts matching the pattern at the beginning of the input string/region or at the first character that was not matched if the previous call to this method was successful.

If you want to find the subsequence starting from the specific position of the input string, you can use the overloaded find method.

It is the same as the find method without any parameters but starts matching from the specified index instead of the beginning of the input string/region.

Unlike the matches method, the find method returns true for the partial matches.

Output

If the find method is able to match the pattern with the input sequence, we can get more information regarding the match using below given methods.

The Matcher start method returns the starting index of the previous match.

The Matcher end method returns the offset after the last character matched.

The Matcher group method returns the substring that matched in the previous match.

This overloaded group method accepts the group number and returns the substring captured by the specified group during the last match. Group 0 represents the entire pattern so calling group(0) is the same as calling the group method without any parameters.

Let’s now see how to use these methods to extract the relevant information from the match operation.

Output

As we can see from the output, the find method starts matching at the beginning of the input string initially. When we call it again after the successful match, it starts matching from the end of the previous match to find more matches.

There is an overloaded start, end and group methods that accept a group number. I will explain these methods later in this tutorial under the capturing group section.

How to match the pattern with the input string using the lookingAt method?

The Matcher lookingAt method matches the pattern with the input sequence/region at the beginning.

The lookingAt method starts matching at the beginning of the input/region but does not require the entire input/region to be matched with the pattern.

Output

Differences between the matches, find and lookingAt methods

The matches and lookingAt methods always start matching at the beginning of the input or region while the find method does not.

The lookingAt and find methods match a subsequence or substring within the input or region, while the matches method requires the entire input to be matched with the pattern.

The find matches multiple substrings or subsequences matching with the given pattern. The matches and lookingAt method matches only once.

How to reuse the matcher using the reset method?

The Matcher reset method resets this matcher object.

It discards all the state information and resets the position to 0. There is also an overloaded reset method that accepts a new character sequence.

This method resets this matcher object with the new input sequence. We can use it when we want to match multiple input sequences with the same compiled pattern as given below.

Output

How to create a search region of the input string to match using the region method?

The Matcher region method creates a region within the input sequence.

In other words, the region method sets the boundary for the matcher to find the matches within the input sequence. Calling this method will reset this matcher object and then set the region.

Output

As we can see from the output, after setting the region, the find method found only match in the specified region of the input string. This is useful when we want to find a match within a specific area of the input sequence.

We can get the start and end index of the region using the regionStart and regionEnd methods.

The regionStart method returns the start index of this matcher object’s region.

The regionEnd method returns the end index of this matcher object’s region.

How to change the pattern of the matcher object using the usePattern method?

The Matcher usePattern method changes the pattern of this matcher object.

The usePattern method changes the pattern used by this matcher object to find the matches.

Please note that this matcher’s position in the input sequence is maintained as-is. In other words, the new pattern finds the matches after the current position of this matcher object. To understand this, let’s see an example.

As we can see from the output, changing the pattern of the matcher object did not match the first substring “dog” that starts at index 0 but it searched the new pattern from the current position onwards.

Apart from these methods, the Matcher class also provides the replaceAll and replaceFirst methods that I have covered in the Java String tutorial.

3. Understanding the Regular Expression Patterns

In the first two parts, I have shown you how to create a pattern and obtain a matcher object to do various types of matching. In this part, we will dive deep into how to create different types of patterns to find and extract the data from the input string.

How to match a string literal?

The most basic type of matching is matching a string with another string like given below.

Remember that the matches method matches the whole string against the pattern so the below given pattern will not match.

What are the metacharacters?

The metacharacters are the characters with a special meaning in the regular expressions. Below given are the metacharacters supported by Java regex API.

^ Matches the start of the input string
$ Matches the end of the input string
. Matches any character except for the new line character
* Matches the previous character or expression zero or more times
+ Matches the previous character  or expression one or more times
? Matches the previous character  or expression zero or one time
() Match grouping
[] Defines a character class
\ Used to escape metacharacters to make it a literal character for matching
| OR operator
{} The number of times the previous pattern needs to be matched.
(?=RegEx) Positive lookahead matching
(?!RegEx) Negative lookahead matching
(?<=RegEx) Positive lookbehind matching
(?<!RegEx) Negative lookbehind matching
\t Matches the tab character ‘\u0009’
\n Matches the new line character ‘\u000A’
\r Matches the carriage return character ‘\u000D’
\f Matches the form feed character ‘\u000C’
\a Matches the alert character ‘\u0007’
\e Matches the escape character ‘\u001B’
\cx Matches the control character corresponding to x
\\ Matches the backslash character
\Q Starts of the quote
\E Ends of the quote

How to match at the start and end of the input string?

The ^ metacharacter matches at the start of the input string while the $ metacharacter matches at the end of the input string as given below.

How to match any number of characters?

There are 4 metacharacters that control how many characters need to be matched and how many times. These metacharacters are “.”, “+”, “?” and “*”. The dot metacharacter matches any character in the input string.

The “?” metacharacter matches the previous character or expression zero or one time.

The “+” metacharacter matches the previous character or expression one or more times.

The “*” metacharacter matches the previous character or expression zero or more times.

We can also combine the “.” with “?”, “+” and “*” metacharacters to match any character any number of times as given below.

How to match using the OR (|) operator?

The “|” (pipe) metacharacter is used to denote the OR condition in the regular expressions.

How to escape metacharacters in the regular expression?

The double backslash (\\) is used to escape the metacharacters in the regular expression to match them as the regular characters.

What are the character classes?

We can define a character class using the square brackets “[” and “]”. We can use any of the below given syntaxes to define a character class as per our requirement.

[123] Matches 1, 2, or 3
[^123] Matches any character except 1, 2, or 3 (not 1, 2, and 3)
[0-9] Matches any digit in the range of 0 to 9.
[a-z] Matches any character between a to z.
[a-zA-Z] Matches any character between a to z and A to Z.
[a-eF-H] Matches any character between a to e and F to H.
[1-3[5-8]] Matches any digit between 1 to 3 and 5 to 8. Same as above.
[a-e&&[bc]] Matches any character between a to e AND b or c so effectively matches only b or c character.
[a-e&&[^bc]] Matches any character between a to e except for b and c.
[a-z&&[^b-e]] Matches any character between a to z except for characters between b to e.

Apart from the user-defined character classes, the regex API provides several predefined character classes for our convenience.

. Matches any character (does not match the line terminators if DOTALL mode is not enabled)
\d Matches any digit. Same as [0-9]
\D Matches any non-digit character. Same as [^0-9].
\s Matches any whitespace character. Same as [ \r\t\n\x0B\f]
\S Matches any non-whitespace character. Same as [^\s]
\w Matches any word character. Same as [a-zA-Z_0-9] i.e. any character between a to z, A to Z, an underscore (_), or 0 to 9
\W Matches any non-word character. Same as [^\w]
\h Matches horizontal whitespace character, for example, space or a tab character. Same as [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000].
\H Matches non-horizontal whitespace character. Same as [^\h]
\v Matches vertical whitespace character, for example, a new line character. Same as [\n\x0B\f\r\x85\u2028\u2029]
\V Matches non-vertical whitespace character. Same as [^\v]

We can use them in the same way we used the user-defined character classes above.

Apart from these character classes, Java supports below given POSIX character classes (matches only US ASCII).

\p{Lower} Matches a lower case alphabetic character. Same as [a-z]
\p{Upper} Matches an upper case alphabetic character. Same as [A-Z]
\p{Alpha} Matches all alphabetic characters. Same as [\p{Upper}\p{Lower}]
\p{Digit} Matches a digit. Same as [0-9]
\p{Alnum} Matches an alphanumeric character. Same as [\p{Alpha}\p{Digit}]
\p{Punct} Matches a punctuation mark. [!”#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
\p{Graph} Matches a visible character. Same as [\p{Alnum}\p{Punct}]
\p{Print} Matches a printable character. Same as [\p{Graph}x20]
\p{Cntrl} Matches a control character. [\x00-\x1F\x7F]
\p{XDigit} Matches a hexadecimal digit. [0-9a-fA-F]
\p{Blank} Matches a space or a tab. Same as [ \t]
\p{Space} Matches a whitespace character [ \t\n\x0B\f\r]
\p{ASCII} Matchs all ASCII characters [\x00-\x7F]

Below given character classes are also supported by Java regular expressions.

\p{javaLowerCase} Same as java.lang.Character.isLowerCase()
\p{javaUpperCase} Same as java.lang.Character.isUpperCase()
\p{javaWhitespace} Same as java.lang.Character.isWhitespace()
\p{javaMirrored} Same as java.lang.Character.isMirrored()

How to control the number of times a pattern should match?

We can use the curly brackets {} to control how many times a pattern should match with a given input string.

{x} The previous character or a pattern should match exactly x number of times.
{x,} The previous pattern should match at least x number of times or minimum x number of times.
{x,y} The previous character or a pattern should match a minimum of x number of times and the maximum of y number of times.

Tip: You cannot skip the first part before the comma (,) while mentioning the range. If you do, PatternSyntaxException will be thrown. For example, {,4} will be an invalid pattern. If you just want the specify the maximum number of matches, you can specify the minimum as 0 as given in the below example.

Let’s take a break and test the knowledge

Now we have got a basic understanding of how the regular expression works. Let’s create a complex pattern using the knowledge we have gained until now. The example I want to cover is to validate date syntax in dd-mm-yyyy format.

The first part of the pattern is dd i.e. the day. The day of the month should be two digits and between 1 to 31. Breaking it up will give us the days from 01 to 09, 10 to 19, 20 to 29, 30, and 31.

So the pattern will be “0[1-9]|[12][0-9]|3[01]”. The whole pattern means 0 followed by any digits between 1 to 9 OR 1 or 2 followed by any digits between 0 to 9 OR 3 followed by either 0 or 1. It should cover all the days between 1 to 31.

Let’s test this part.

The second part of the pattern is the month part, which should be two digits and must be between 1 to 12. So the pattern will be “0[1-9]|1[012]” means 0 followed by any digit between 1 to 9 (to cover months between 1 to 9) OR 1 followed by 0, 1 or 2 (to cover months between 10 to 12). Let’s test it.

Now let’s see the year part. The year could be anything but it must be exactly 4 digits long which is fairly easy to convert to a pattern. The pattern to check the year will be “[0-9]{4}” and it means any digit between 0 to 9 four times.

Let’s test it.

Now let’s put them all together in one pattern to validate date syntax in the “dd-mm-yyyy” format. The whole pattern will be like “(0[1-9]|[12][0-9]|3[01])-(0[1-9]|1[012])-([0-9]{4})”. We just clubbed all the individual parts using groups and put a “-” between them.

Let’s try out the whole pattern against the example dates.

The above example is just meant to understand how to create a pattern for your requirements. Please note that the regex can only validate the date syntax, not the actual date. For example, our pattern will say valid date to “31-02-2004” while it is not (because February does not have 31 days). Please visit how to validate a date example to understand the right way.

How to match the boundaries using the regular expression?

The Java regular expressions support below given boundary matchers.

\b Matches with the word boundary
\B Matches with the non-word boundary
^ In single-line mode, it matches at the start of the input. In the multiline mode, it matches at the start of each line.
\A Matches at the start of the input. Regardless of a single line or multiline mode, it matches exactly once, i.e. at the start of the input.
$ In single-line mode, it matches at the end of the input. In the multiline mode, it matches at the end of each line.
\Z Matches at the end of the input or the last input terminator if present regardless of the line mode.
\z Matches at the end of the input. Regardless of a single line or multiline mode, it matches exactly once, i.e. at the end of the input.
\G Matches at the end of the previous match

The boundary matchers are very useful to indicate the position where we want to match in the input string.

Output

The below given code matches the word “dog” with the input string.

Output

The pattern “dog” not only matched with the word “dog” in the sentence but it also matched the string “dog” within the “hotdog”. If you want to match the whole word only, you can use the word boundary as given below.

Output

The pattern “\\bdog\\b” means a word boundary, followed by characters “dog”, followed by a word boundary, so it matched with a whole word only.

What are capturing groups and how to use them in regular expressions?

The group in the regular expression is a single unit of characters and created using the opening and closing parentheses “(” and “)”.  The part of the input string captured using the group is stored for the later reference.

Groups are extremely useful when we want to extract some information from the input text.

The below given example extracts the number from the string using the pattern “(\\d+)” which means a group of one or more digits.

Output

We can have more than one group in our regex including the nested groups. The groupCount method of the Matcher class returns the total number of groups in the pattern.

The capturing groups are numbered by counting the opening parenthesis from the left to right. For example,  the expression (x(y(z)) contains below given 3 groups in the given sequence.

  1. (x(y(z))
  2. (y(z))
  3. (z)

There is also a special group, group 0, that represents the entire expression and is not included in the total number of groups obtained using the groupCount method. We can get the matched group by specifying the group number in the group method of the Matcher class.

It returns the input string captured by the given group number.

Output

We can also get the starting and ending index of the individual groups using the start and end methods of the Matcher class respectively.

Output

The captured group is stored for the later reference which we can refer to using the “\” followed by the group number.

Output

The above example captures the group of character “a” followed by “b” and then later it is referenced using the group number 1 in the regular expression pattern. The pattern “(ab)\\1” is equivalent to the pattern “abab”.

Additionally, you can refer to these groups while replacing the values in the string using the “$” sign followed by the group number as given below.

Output

The above given code finds the group of characters “a” followed by “b” in the input string “abab1234abab”. It then replaces the group using the $1 followed by character “c”. Effectively, it replaces all matches of “ab” with “abc”.

What are the quantifiers and how to use them in regular expressions?

The Java RegEx API provides three types of quantifiers namely greedy, reluctant, and possessive as given below. The default quantifier is greedy.

Greedy Reluctant Possessive
* *? *+ Matches 0 or more times
+ +? ++ Matches one or more times
? ?? ?+ Matches 0 or one time
{x} {x}? {x}+ Matches exactly x times
{x, } {x, }? {x, }+ Matches minimum x times
{x,y} {x,y}? {x,y}+ Matches minimum x times and maximum y times

At the first look, they all look the same, for example, “*”, “*?”, and “*+” all three matches 0 or more times. But there is a difference between them.

The greedy quantifier consumes or reads the whole input string first even before attempting to match. If the match is not found, the regex engine backtracks the input string by one character at a time to match until the match is found or there are no more characters to backtrack from. It tries to make the longest match possible.

The reluctant quantifier is exactly the opposite of the greedy quantifier. It tries the shortest match possible and starts to match at the beginning of the input string. If the match is not found, it consumes one character at a time until the match is found or it has consumed all the characters of the input string.

The possessive quantifier is similar to the greedy quantifier as it also consumes the whole input first. But unlike the greedy quantifier, the possessive quantifier does not backtrack the characters if the match is not found. It eats the entire input string and tries to match only once.

Let’s see an example of what all this means. The below given example tries to match any character zero or more times followed by 12 against an input string.

Output

The greedy “.*” quantifier consumes the whole input string “1234512” first but it fails to match with the overall expression because there are no characters left to match with the last 12. So the regex engine backtracks one character from the input string “123451” to match against .* but it still fails to match with the whole expression.

So it backtracks one more character and tries to match the “.*” with “12345” with the overall pattern. At this point, the match is found. So it tries the longest match first.

Now let’s see what happens when we use the pattern with reluctant *? to match against the same input string.

Output

In this case, the engine starts matching .*? (any character zero or more times followed by 12) with as few characters as possible and starts matching at the start of the input string. Since the match is not found, it consumes one character from the string and matches .*? with “1” but it still fails. So it consumes one more character and matches .*? with “12”. At this point, the first match is found.

Similarly, the second match is also found at the end of the string. So, basically it starts matching at the start of the string and tries to match as few characters as possible.

Now let’s try the possessive quantifier .*+ with the same input string and pattern.

Output

It could not find any match. The possessive quantifier first consumes the whole string to match against .*+ and tries to match against the overall pattern. Since there are no more characters left in the input string to match with the ending 12, the match fails.

Unlike the greedy quantifier, the possessive quantifier does not backtrack the characters to match, so it tries to match only once with the entire input string.

You may ask why do we have the possessive quantifier then? Well, it offers better performance as compared to greedy quantifiers in the cases where we need to match against the entire input. Since it does not backtrack, fewer matches need to be done.

What are the negative and positive lookahead and lookbehind expressions?

The lookahead and lookbehind expressions are collectively referred to as lookaround expressions. As the names suggest, they look around left or right from the current position. The lookaround expression matches the character sequence without actually including them in the match. It does not consume the input sequence, it just tests whether the given pattern can match or not.

For example, if we want to search a digit “9” followed digit “1”, we will use something like this using the character class.

Output

We wanted to match the digit “9” that is followed by “1”. But what we got is “91” not “9”. This is where the lookaround expression comes handy. It does not consume the “1” but only checks if we have a digit “9” in the input string that is followed by “1” without consuming the digit “1”.

Positive lookahead expression

The positive look ahead expression is used when we want to find something that is immediately followed by something else and it is created using "(?=RegEx)" syntax. I will use the same example given above to find a digit “9” that is followed by “1”. This expression is successful only when the given regex matches.

Output

So now we got only 9. It just tested that there is a 9 at index 0 that is followed by the digit “1” at index 1 but did not consume the digit “1” in the match.

Negative lookahead expression

It is similar to the positive lookahead expression but it is used when we want something that is not immediately followed by something else. The negative lookahead expression is created using the "(?!RegEx)" syntax. This expression is successful only when the given regex fails to match.

For example, the below given expression finds the digit “9” that is not followed by digit “1”.

Output

The first “9” at index 0 did not match because it is followed by digit “1” while the second “9” at index 2 of the input string did as it is not followed by the digit “1” but “2”.

Positive lookbehind expression

The positive lookbehind expression looks in a backward direction. It is used when we want something that is preceded by something else or in other words to check if something else comes before something in the input string.

The positive lookbehind expression is created using the “(?<=RegEx)” syntax.

The below given example tries to match a digit “1” that is preceded by a digit “9” in the input string (or in other words, we want to find a digit “1” having “9” as the previous character). Keep in mind that we do not want “91” as a match, we just want “1” that is preceded by a digit “9”.

Output

The digit “1” at the index 1 is matched because the previous digit is “9” at index 0. However, the digit “1” at index 2 did not match because it did not have the digit “9” before it.

Negative lookbehind expression

The negative lookbehind expression is similar to the positive lookbehind expression but it is used to check whether something is not preceded by something else. It is created using the "(?<!RegEx)" syntax.

The below given example tries to match a digit “1” that is not preceded by the digit “9” (in other words a digit “1” whose previous digit is not “9”).

Output

Here the first “1” at index 1 did not match because its previous digit was “9”. However, the digit “1” at index 2 matched because the previous digit of it was not “9”.

Below given are some of the additional Java Regular Expression examples that show the real-world usage of the regular expressions in Java.

Java Regular Expression Examples (RegEx examples)

String Operations

Validations

References:
Java 8 Pattern Class Documentation
Java 8 Matcher Class Documentation

Please let me know if you liked the Java Regular Expression tutorial with examples in the comments section below.

About the author

Leave a Reply

Your email address will not be published.