January 16, 2009 (Lecture 3)

January 16, 2009 (Lecture 3) Return to lecture notes index

January 16, 2009 (Lecture 3)

What Are Regular Expressions? Why Do We Care?

In many cases, when processing text or other types of data, we want to do pattern matching. This is probably familiar to many of you in the simple form of using an *-asterisk as a wildcard. But, we might also want to search for far more complicated patterns, for example dates, phone numbers, room numbers, time of day, dollar amounts, &c. Another application might, for example, to find any data that doesn't match a particular pattern -- for example, to deiscover corrupted records.
Regular Expressions, which are supported in many languages and tools, including Java, are a very flexible way of describing such patterns. Regular expressions are essentially a really compact language for describing patterns. Sometimes people call regular expressions RegEx for short.

The Building Blocks

Regular expressions are composed of literals and metcharacters. Most symbols used within regular expressions are literals, which match exactly themselves. Literals, for example, include, in most contexts, all of the letters and numbers. Metacharacters are symbols that have special meanings. Some of the more common metacharacters are listed below:

Symbol Meaning

. Matches any single character

^ Matches the beginning of a line

$ Matches the end of a line

[] Matches any one of the symbols within the brackets, hypen (-) notation for ranges okay

[^] Matches things not within the list

* Matches 0 or more of the pattern to the immediate left

+ Matches 1 or more of the pattern to the immediate left

? Matches 0 or 1 of the pattern to the immediate left

{x,y} Matches between x and y of the pattern to the immediate left, inclusive

(x) Save a match of pattern x to a register, also known as "marking" an expression.

\n register #n (there are registers 1-9)

& The matched string. In other words, the literal string that mached the pattern you supplied

<word> the word, not any substring word

x|y Matches either pattern x or pattern y

Java also defines some predefined character classes, which are basically shortcuts for describing common set of characters. Consider the following:

Character class Meaning

\d any digit character

\D anything character except a digit character

\s any whitespace character

\S any character except a whitespace character

\w a word character (letter or number or _-underscore)

\W anything but a word character

Additionally, java recognizes some common escape characters. A few examples are below:

Escaped character Meaning

\t tab

\n newline

\r carriage return

Symbol	Meaning
.	Matches any single character
^	Matches the beginning of a line
$	Matches the end of a line
[]	Matches any one of the symbols within the brackets, hypen (-) notation for ranges okay
[^]	Matches things not within the list
*	Matches 0 or more of the pattern to the immediate left
+	Matches 1 or more of the pattern to the immediate left
?	Matches 0 or 1 of the pattern to the immediate left
{x,y}	Matches between x and y of the pattern to the immediate left, inclusive
(x)	Save a match of pattern x to a register, also known as "marking" an expression.
\n	register #n (there are registers 1-9)
&	The matched string. In other words, the literal string that mached the pattern you supplied
<word>	the word, not any substring word
x\|y	Matches either pattern x or pattern y

Character class	Meaning
\d	any digit character
\D	anything character except a digit character
\s	any whitespace character
\S	any character except a whitespace character
\w	a word character (letter or number or _-underscore)
\W	anything but a word character

Escaped character	Meaning
\t	tab
\n	newline
\r	carriage return

A Note On Escaping

What if we want to match a *-asterisk? Or a ^-carrot? In order to match the literal value of character that happens to be a metacharacter, it is necessary to "Escape it". This is done by placing a \-slash before the character, for example, "\." And, if one wants a literal slash -- well, slash-slash, "\\", of course.
On thing that often confuses those new to Java is that, in order to match, for example, a tab, the \, itself, must be escaped -- "\\t". The reason for this is that, unless the \-slash is escaped, it will be translated into the escaped character before it makes its way into the library functin. This is true, of course, for all of the escaped characters (\d, \r, etc).

Greedy vs Releuctant vs Posesssive

Let's consider the quantifiers: ?, *, {x,y}, and +. If we construct an expression that contains pairs of these back-to-back, we can run into interesting questions. For example, please consider the following regular expression:
  (.*)([a-zA-Z]*)(.*)
  
And, the following input strings:
  12345abcDEF67890
  abcdefghijklmnop
  
In the first case, it is fairly clear that "1234" needs to be matched against the initial ".*". But, beyond that, it becomes less clear. Since the ".*" specifies, in effect, 0 or more of anything, it can eat the entire input. Should this happen, the subsequent "[a-zA-Z]*" and ".*", which request zero or more, can be satisfied with nothing. Or, as another example, the first group could match "12345abcD", leaving "E" for the second group, which could subsequently leave "F67890" for the third group. How are these conflicts resolved?
Unlike in other environments, Java's quantifiers are greedy by default. This means that, moving from left to right, each greedy quantifier, will match as much as it can -- without breaking the rest of the expression. Consider the following example:
    Pattern p = Pattern.compile ("(.*)([0-9]+)(.*)");
    Matcher m = p.matcher("abcdef123456ghijkl");

    m.matches();

    System.out.println (m.groupCount() + " matches:");

    for (int index=1; index <= m.groupCount(); index++)
      System.out.println ("Match: " + m.group(index));

  // The code above outputs the following:
  // 3 matches:
  // Match: abcdef12345
  // Match: 6
  // Match: ghijkl
  
We can change the default behavior of the quantifiers (?, +, *, and {x,y}) to so-called reluctant qualifiers, by appending a ?-mark, e.g., "??", "*?", "+?", or "{x,y)?". Reluctant quantifiers match no more than is necessary to make ther expression match. In otehr words, they match as little as possible. Let's consider a revised example:
  Pattern p = Pattern.compile ("(.*?)([0-9]+?)(.*?)");
  Matcher m = p.matcher("abcdef123456ghijkl");

  m.matches();

  System.out.println (m.groupCount() + " matches:");

  for (int index=1; index <= m.groupCount(); index++) 
    System.out.println ("Match: " + m.group(index));

  // The code above outputs the following:
  // 3 matches:
  // Match: abcdef
  // Match: 1
  // Match: 23456ghijkl
  
Lastly, Java allows quantifiers to be annotated as posessive by adding a "+", e.g., "?+", "*+", "++", and "{x,y}+". These are nasty, selfish quantifiers. As before, they are processed from left-to-right, but they will eat as much as they can -- even if leaving nothing to satisfy parts of the expression to the right. Let's consider another revised example:
  Pattern p = Pattern.compile ("(.*?+([0-9]++)(.*+)");
  Matcher m = p.matcher("abcdef123456ghijkl");

  m.matches();

  System.out.println ("Matching expression? " + m.matches());


  for (int index=1; index <= m.groupCount(); index++) 
    System.out.println ("Match: " + m.group(index));


   // The code above dies as follows:
   // Matching expression? false
   // Exception in thread "main" java.lang.IllegalStateException: No match found
   //      at java.util.regex.Matcher.group(Matcher.java:468)
   //      at Quick.main(Quick.java:14)

   
The reason that the stirng above fails to match the prescribed pattern is that the first group posessively matches the entire string, starving the second and third groups. Since the seocnd and thrid groups do not match, the expression, as a whole, doesn't match. Posessive quantifiers do not always break an expression -- but, they only allow matches if the posessive pattern does not also match the subsequent pattern.

Examples of Regular Expressions

^#.* any line beginning with a #
datafile[0-9]+\.txt datafileN.txt, datafileNN.txt, &c
-{0,1}[0-9]+ a positive or negative integer number
-{0,1}[0-9]*\.[0-9]+ a positive or negative floating point number
^Greg Greg, but only at the beginning of a line
^Greg$ Greg, but only if it is the only thing on the line
^$.*$|.*|\1$ Assume that this is a database with a |-pipe as the field separator. This matches if the first and last field are the same

Why are the Called "Regular" Expressions? And, Are they Really Regular?

"Regular expressions" are so-known because, as you'll learn in a later course, regular expressions actually describe languages. You can imagine that any patter described by a regular expression is a valid sentence in that expression's language.
Any language recognizable by a true regular expression falls into a category of languages known as regular languages. You guys are probably familiar with Finite State Machines (FSMs). Any language recognizable by a FSM is a regular language. Similarly, a FSM can be constructed to accept any regular expression.
It is interesting to note that FSMs have no memory beyond their states and transitions. There is no way of remembering an expression for use later. This would only be possible if there existed a unique state for each remembered expression. And, that isn't possible if expressions can have an infinite length -- an infinite number of states would be required.
The practical consequence of this is that some of the so-called regular expressions that we'll see aren't really regular, because they require remembering things. Instead, they are sometimes called "Extended regular expressions". Although including such patterns makes pattern-matching much more powerful -- it can also affect the design of the underlying recognizer, making it operate much more slowly.
In general, the "original" regular expressions, as accepted by early UNIX tools were all truly regular. They are now sometimes known as "Basic Regular Expressions". The expanded language presently supported in many UNIX tools, Java, Perl, Python, and other environments is often known as "Extended Regular Expressions".
Some of these new tools aren't really "regular" as the patterns they describe can't be recognized by a FSM. Consider, for example, the ability to capture groups and use the captured group within the same expression, e.g. "^([0-9]+)[a-zA-Z.*]\1", which mataches a string that contains the same number on the left and the right, with a string sandwiched in between. An FSM has no way of remembering this initial number of unknownable length.

Using Regular Expression in Java

Using regular expressions in Java takes just a few easy steps:

Represent the regular expression in a String, escaping metacharacters used as literals (e.g., "\$") and predefined character classes (e.g., "\\n") as necessary.

Create a new Pattern object representing from this String to use for matching within the program.

Make use of the newly created Pattern by using it to create a new Matcher. The Matcher, born of a particular Pattern and a particular input/target String, is used to examine the input string to see if it matches, as a whole, or in part, the Pattern.

"Step 1" is straight-forward. Just express your idea as a regular expression, either as a string literal or via a reference variable. "Step 2" is also straight-forward. We use Pattern's compile() method to create the new pattern. An example is below:
  String datePatternString = 
          "(19|20)\\d\\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])";
  Pattern datePattern = Pattern.compile(datePatternString);
  
"Step 2" is also very straight-forward. We give the Pattern a string and ask us to create an instance of the Matcher object that can apply the prescribed pattern to the supplied string. An example follows:
  String datePatternString = 
          "(19|20)\\d\\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])";
  Pattern datePattern = Pattern.compile(datePatternString);

  Matcher m = datePattern.matcher("Today's date is 2008-06-01.");
  
"Step 3" is where things get intereesting. The Matcher class is very rich, but, let's look at a couple of examples. The first is the matches() method. It returns true or false, depending on whether or not the entire string matches the prescribed pattern:
  String datePatternString = 
          "(19|20)\\d\\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])";
  Pattern datePattern = Pattern.compile(datePatternString);

  // Notice that this string contains a date, but, when taken as a whole
  // is not entirely and exactly a date.
  Matcher m = datePattern.matcher("Today's date is 2008-01-16.");

  if (m.matches())
    System.out.println ("The string is exactly a date"); // Won't happen
  else
    System.out.println ("The string is NOT exactly a date"); // BINGO
  
We can also walk through a string an extract from it subsequent substrings that match a particular pattern. We might, for example, walk through text and extract each date we find. To do this, we make use of the find() and group() methods in much the same way as we would use the hasNext() and next() method of an iterator. The find() method returns true if, and only if, the next call to group() is able to return another matching substring.
The example below serves as an example of this technique:
  String datePatternString = 
          "(19|20)\\d\\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])";
  Pattern datePattern = Pattern.compile(datePatternString);

  Matcher m = datePattern.matcher
              ("Today's date is 2008-01-16. Yesterday's date was 2008-01-15.");

  while (m.find()) {
    String dateString = m.group(); // This extracts the next date

    System.out.println ("Found date: " + dateString); 
  }

  // The loop above prints:
  // 2008-01-16
  // 2008-06-15
  
The Matcher class is very rich. It can be limited to looking within only certain regions, and allows either the underlying pattern, or the input string to be changed. We discussed in class the idea of using one pattern to find an appendix within a book and then, once there, using another pattern to extract data from the tables. Regardless, take a look at the JavaDoc for the Matcher class for more details about the various capabilities of this class of objects.

An Interesting Student Question

Student question: Why is Pattern designed such that new instances are created via the compile() method rather than a traditional constructor?
Answer: I don't actually know, for sure, what the designers of this class were thinking. But, there are two common reasons why a class's design might opt for a static instantiation method as opposed to a traditional constructor.
In this case it is important to note that the String passed in as a representation of the regex could be invalid as a regular expression (consider, for example, a string containing "**"). If this is the case, it is impossible to create a Pattern object that represents it.
Ideally, we'd like the compiler to pre-flight this at compile time. But, since RegExs aren't first class types, and are somewhat complex, this is asking a lot of the compiler. And, even this wouldn't prove to be a comprehensive solution. It is possible, for example, that the string is read in at runtime from a file or the user.
As a result, whatever mechanism is used to instantiate a new Pattern must play defense against a bad pattern string. And, in Java, this more-or-less means verifying the string and throwing an exception if it doesn't represent a RegEx.
This brings us to the most likely reason that the designers of the Pattern class chose to instantiate patterns via a static method instead of via a traditional constructor. Although Java, like many OO languages, allows constructors to throw exceptions, there is a school of thought in software engineering that constructors should be "clean" and only throw exceptions in those cases that are completely out of the control of the program (e.g., out of memory). The reason for this design principle is that it often makes handling these exceptions easier if the offending operations are factored out of the constructor into other methods, such that there might be a more structured opportunity to handle them and a more flexible set of options than do-over-from-scratch. This design principle becomes especially important in languages, such as C++, where temporary objects are frequently automatically created -- and constructor calls can be completely invisiable to the programmer. None-the-less, in a case where there is such a clear and prevalent error mode as a non-pattern passed in via an unconstrained string, there is a strong argument to use a static method to call a programmer's attention to the fact that the object's intialization is non-trivial and can fail.
The, perhaps more common, reason to use a static initializer method instead of a constructor is to implement the "Singleton pattern". The "Singleton Pattern" is a technique for instantiating objects where it is desirable to have not more than one instance of that type of object. Consider, for example, a spell checking window in a word processor. If the window accidentally got hidden or minimized, it would be very confusing to the user if upon rerequesting the spell checker, additional instances were created. Instead, the original instance is raised. A new instance is only created if not other instance presently exists. As mutliple pattern instances can exist simultaneously, and, so far as I know, they don't share any common dynamically allocated machinery, I don't think this case is applicable here, but I explain it for completeness.
The "Singleton" pattern is implemented by using a combination of a static state variable and static initialization method. This static variable, statically initialized to null, is a reference to the instance of the object, if it exists. The static initialization method first looks at this variable. If it is non-null, it returns a reference to this instance, ratehr than creating a new one. If it is null, then it knows that no other instances exist, create a new instance and assigns the static state variable a reference to it, and returns this reference. So, the public the gatway: It calls the private constructor if, and only if, necessary and then returns the value of the static reference to the one instance of the object. In this way, it ensures that not more than one instance of the object exists.