Return to lecture notes index

September 15, 2009 (Lecture 7)

Regular Expressions and String Manipulation

these days, there are many great reasons to program in perl. One of those happens to be the first among those: its natural ability to play with strings and, in particular, regular expressions.

The following two operators, =~ (match) and !~ (no match), are among the most basic. =~ returns the number of times a substring matching the regular expression is found in the supplied string. Sometimes it is interpreted as a true/false expression, where 0 matches is false (not found). The "not in" opertor !~ retunrs true if no matches are found.

The general forms are as follows:


    $nummatches = ($somestring =~ /regular expression/); 
    $notin = ($somestring !~ /regular expression/); 
  

If you group parts of a regular expression within ()-parenthesis, and the regular expression is matched, each match within ()-parenthesis will be saved into a special variable -- much as was the case with, for example, sed. These special variables are $1, $2, etc. Careful! Careful! Everyone wants to believe that these variables represent command-line arguments as they do in shell. Notice the difference! It is also worth noting that, although not preferred, Perl will accept the \1, \2, /3, etc, notation common in many other programs. Regardless, here's a quick example:

  if ( $somestring ~= /([0-9]+)[a-zA-Z]*([0-9]+)/) {
    # $1 is the number at the begining of the line
    # $2 is the number at the ending of the line
  } else {
    # $1 and $2 are unchanged
  }
  

perl also has a special variable, $_, which represents the default string. Several important operators act on this string by default. For example, perl can do sed-style searching and replacing. When this type of expression is defined, it is acting upon $_:


  $_ = "This is an example string: Hello World";

  $changes = s/World/WORLD/g;

  print "$_\n"; # "World" is now WORLD 

  print "$changes\n"; # The number of substitutions made; in this case, 1
  

The tr function is also very powerful. It acts much like the tr command. It allows the user to define a mapping of character-for-character substitutions and applies them to $_. Each character in the first field will be replaced by the corresponding character in the second filed. As with th s function above, it returns the number of substitutions:


  $changes = tr/abc/123/; # a becomes 1, b becomes 2, c becomes 3
  

Please note: In the examples above, there are no quotes around the tr and s expressions. This is important. If the expressions are quoted, they'll be interpreted as strings and assigned, instead of interpreted as regex operations and performed.

Greedy and Posessive Quantifiers

We can change the default behavior of the quantifiers (?, +, *, and {x,y}) to so-called reluctant qualifiers, by appending a ?-mark, e.g., "??", "*?", "+?", or "{x,y)?". Reluctant quantifiers match no more than is necessary to make ther expression match.

Lastly, Perl allows quantifiers to be annotated as posessive by adding a "+", e.g., "?+", "*+", "++", and "{x,y}+". These are nasty, selfish quantifiers. As before, they are processed from left-to-right, but they will eat as much as they can -- even if leaving nothing to satisfy parts of the expression to the right.