Java String Handling RegEx

Java split String by words example

Java split String by words example shows how to split string into words in Java. Example also shows how to break string sentence into words using split method.

How to split String by words?

Below given simple code will break String sentence into words.

Output

As you can see from the output, it worked for the test sentence string. The sentence is broken down into words by splitting it using space.

Let’s try some other not-so-simple sentences.

Output

As you can see from the output, our code did not work as expected. The reason being is simple split by space is not enough to separate words from string. Sentences may be separated by punctuation marks like dot, comma, question marks etc.

In order to make the code handle all these punctuation and symbols, we will change our regular expression pattern from only space to all the punctuation marks and symbols as given below.

Output

This time we got the output as we wanted. The pattern [ !\"\\#$%&'()*+,-./:;<=>?@\\[\\]^_`{|}~]+ includes almost all the punctuation and symbols that can be used in a sentence including space. We applied + at the end to match one or more instances of these to make sure that we do not get any empty words.

Instead of this pattern you can also use \\P{L} pattern to extract words from sentence, where \\P denotes POSIX expression and L denotes character class for word characters. You need to change the line with split method as given below.

Please note that \\P{L} expression works for both ASCII and non-ASCII characters (i.e. accented characters like “café” or “kākā”).

Want to learn quickly?
Try one of the many quizzes. I promise you will not be disappointed.

Tags

About the author

rahimv

rahimv

rahimv has over 15 years of experience in designing and developing Java applications. His areas of expertise are J2EE and eCommerce. If you like the website, follow him on Facebook, Twitter or Google Plus.

Add Comment

Your email address will not be published. Required fields are marked *