Table of Contents
- Table of Contents
- Introduction
- Use Cases
- Predefined Character Classes
- Custom Character Classes
- Greedy Counts
- Additional Keywords
- Logical Operators
- Invoking
- Capturing Groups
- Helpful Flags
- Examples
- Summary
Introduction
When consuming String inputs, we may need to search within that string to see if part or all of it matches a pattern in order to determine that it is valid, or to extract or replace the part(s) that matched. Is the email or phone number a valid format? Is my URL in my configuration file using an acceptable protocol (HTTP, HTTPS, FTPS, etc.)? This is a good use case for regular expressions.
Use Cases
Here are some possible use cases and additional considerations:
- Input validation – format, allowed characters, etc. Be careful, this may not be enough to ensure safe inputs. Things like encoding a string in another character encoding can mask actual input values. It is a start and helpful in many cases, but does not ensure safety in all scenarios.
- Parsing inputs out of a more complicated String Input – Also consider separating out the inputs into different fields on the input structure, whenever possible. Parsing inputs in a larger string can be brittle and problematic.
Predefined Character Classes
In Java, a String is a grouping of individual characters. It is similar to an array of primitive char elements (char[]). A few of the already defined groupings of those characters available for use in our regular expressions are:
| Character Class Identifier | |
|---|---|
| . | The period (.) can match any character. Whether this includes line terminators or not depends upon the flag(s) used when compiling the pattern. |
| \d | A digit character. Any matching: [0-9]. |
| \D | A non-digit character. |
| \w | A word character. Any of [a-zA-Z_0-9] |
| \W | A character that does not match the word character set above. |
| \s | A whitespace character, such as space, tab, newline, carriage return, line feed. |
| \S | A character that does not match the whitespace characters. |
There are many more character groupings. You can find them at Pattern.html.
Custom Character Classes
If the predefined character classes do not meet our needs, then we can define our own. For example, to accept the characters a, b, c, or d, we could specify a character group as: [a-d] or we could also specify the individual characters, like [abcd].
Greedy Counts
These counts will match as many characters as possible:
| Format | Comments |
|---|---|
| ? | Zero or one matches. |
| * | Zero or more matches. |
| + | One or more matches. |
| {n} | Exactly n matches. |
| {n,} | At least n matches, but possibly more. |
| {n,m} | At least n matches, but up to m matches. |
There are also reluctant (as few matches as possible) counterparts to each of the items above – which is achieved by adding a ? to the end of the contents in the Format column.
Additional Keywords
| Format | Comment |
|---|---|
| . | Any character. Whether or not end-of-line characters match depends upon the mode. |
| ^ | Beginning of the line or input String, depending upon the mode. |
| $ | End of the line or input String, depending upon the mode. |
Logical Operators
We can use a pipe (|) character to indicate that either option is acceptable.
A|B is good to go.
If we want to extend this to more than one character, we need to add a capturing group, so the values are noted as related.
Invoking
If you do not need to pass flags for the pattern as you compile it and don’t need to compile the pattern multiple times, you can invoke:
Pattern.matches(regularExpressionPattern, inputString);
If you need to pass some flag(s) when compiling the pattern, or if you need to reuse the pattern with multiple inputs, you will need to call multiple methods. Multiple flags can can be provided with the bitwise OR (AKA the pipe (|) character).
Pattern.compile(regularExpressionPattern, Pattern.CASE_INSENSITIVE|Pattern.DOTALL).matcher(inputString).matches()
If you want to instead match part of the string, you can use the find() method repeatedly instead of the matches() method.
Capturing Groups
If you need to capture a piece of the matched String, then you can use a capturing group, a matching pair of open and close parenthesis () around the part of the pattern that you wish to capture. Capture groups are numbered from left to right, based upon the location of the opening (left) parenthesis.
The quick brown (fox|cat|bear) jumped over the (fence|rock).
Group # 0 will match on the entire string (or sub-string) that matched the pattern, so in this case, fox or cat or bear will be stored in capture group # 1 and fence or rock will be stored in group # 2.
Capturing groups can be utilized once the pattern is compiled, and the input String provided to retrieve a Matcher object – via the group(int group) method.
Helpful Flags
You can change the behavior (mode) of the regular expression pattern with some flags during the call to the compile method.
| Pattern Flag Constant | |
|---|---|
CASE_INSENSITIVE | Upper and lower case characters will both match, regardless of which case is used in the pattern. |
DOTALL | By default, the . character will not match line terminating characters. This flag allows it to match any character, including end of line characters. |
MULTILINE | ^ and $ match at start and end of lines, rather than only the start and end of the entire input string. |
COMMENTS | Allow comments in the pattern. The # character can be used to indicate a comment through the end of the line. Whitespace is ignored in this mode. |
Additional flags are available as constants in the Pattern Javadoc documentation.
Examples
Below are some overly simplified examples of regular expressions and what they may validate. A production usage will likely require a much more extensive approach. For example, are there customers outside of the US, which require a different phone or zip code format? Are special characters allowed in the email address values? Do we want to support more than just the one domain name?
| Example | Comments |
|---|---|
| \d{3}-\d{3}-\d{\4} | A US phone number format. For example: 312-555-1234 |
| \d{5}(-\d{4})? | US Zip Code with optional Zip + 4 extension. |
| \w+@mydomainname.com | Very (overly) simple email validation if you own the domain name provided. |
Summary
Regular expressions can be used to parse or validate String based inputs. Regular expressions help, but do not ensure safe inputs in all cases. We an change the behavior of the pattern by using flags when compiling it. Capturing groups can be used to retrieve a part or all of the matched input String. Depending upon whether we need to provide any flags while compiling a pattern, or use the pattern with multiple inputs, we may be able to skip the compile step and directly call the overloaded matches method.

Leave a comment