September 6, 2007 (Lecture 4)

September 6, 2007 (Lecture 4)

AWK and PERL

Today we discuss two programming languages regularly used for administrative tasks: AWK and PERL. In many ways, AWK is "The Orginal Pattern Matching Language". It was created to provide a convenient and powerful way of churning through, munging, and sometimes analyzing data files. PERL is a more recent language which was originally designed for essentially the same purpose, but has really developed into a full-fledged, general purpose programming lanaguge which is often used any time efficiency isn't a compelling concern.

AWK - An Overview

AWK is an early string processing language that has passed the test of time. It is named after Aho, Weinberger, and Kernighan, all highly esteemed computer scientists whose contributions are many.
In describing the purpose of AWK it would sound very much like I was describing the purpose of Perl, with which many of you are probably familiar. AWK is a language designed for processing strings and is often used to extract, process, or otherwise munge data from reports or other structured text documents. Until the advent of Perl, AWK and sed were the workhorses of text processing.
These days, AWK is showing some of its age. It was designed at a time when processing power was scare and compilers were weak. As a result, its syntax isn't always user friendly and it can appear quite dense. For this reason, these days, Perl is the prefered choice for most projects medium small and up.
But, because AWK is so dense -- in its sweet spots, it can accomplish complex tasks with very little code, it remains the tool of choice for those "small jobs" that might well be big jobs elsewhere. Short AWK programs can be standalone, but are often incorporated within scripts. Many are just a few lines -- and the one-liners are often the most impressive.
With the advent of Perl, Python, and the like, AWK hasn't become less popular -- its just seen more targetted use as a quick and dirty tool.

Good References for AWK

The Wikipedia Article on AWK

Versions of AWK

There are basically three versions of AWK running around. The orignal, AT&T version, "awk", a newer AT&T version, "nawk", and a version from GNU, "gawk".
In general, the original "awk" isn't really the stanard anymore -- most systems have at least "nawk". And, anyone who is anyone, installs GNU's gawk. If you're using Linux -- you're using gawk. gawk contains almost everything from nawk -- and some more. Just be careful to avoid using gawk-only features if you want compatibility with nawk.

The Basic Paradigm

AWK views its input as a collection of records, each of which contains fields. By default, each line is viewed as a record and fields are delimited by whitespace. But, the field separator (FS) and field separator (FS) can be changed.
AWK is basically event driven. It works its way through its input, from top to bottom, processing one record at a time. As it charges through the records, it can make use of global variables to keep track of things and functions to make things reusable and readable. Its only real data structure is the associative array, e.g. hash table, which can also double as a traditional indexed array (although it might not be as dense), whether single or multidimensional.
Each and every record of a file is processed. But, actions can be dependent on the record patching a certain pattern, for example, the first field containing a particular userid.

Feeding AWK its Programs

Perhaps the easiest way to write an AWK program is to put the program right on the command line. The following program prints the first field of each record. Notice that the program, 'surrounded by ticks' is given to awk as a command-line argument.
  cat file.txt awk ' { print $1 } '
  
Another option it to ask awk to read its program from a source file. This is done with the -f option:
  cat file.txt awk -f sourcefile.awk
  
The last option is to use AWK as an interpreter. In other words, we can create a first-class awk script, much like we did a shell script. Remember the mechanism at work here. A #! at the beginning of the file signals that what follows is the interpreter to run. Then, when this interpreter runs, it is given as its argument the name of the file so it can process it. An AWK scrpt works exactly like this:
  #!/bin/awk -f
  { print $1 }
  
When AWK, especially AWK one-liners are embedded in shell scripts, the program is often passed on the commandline. When a whole application is being developed in AWK, it is almost always written as an AWK script. Sometimes AWK programs are passed using the -f option when they are part of shell scripts, but too large to be squeezed onto a single line. There is no Right Way -- just what feels right.
The only place that newlines can appear within a program is immediately after }-closed-braces. If, for readability, lines need to be broken up, newlines can be added -- but only if the preceding line ends with the \-slash. Used in this context, the \-slash is known as the "continuation character". It must be the last thing on the line -- be careful about stray whitespace.
Because of this, AWK and csh shell scripts are not friendly. csh handles \-slashes differently. At a minimum, they will need to be escaped as \\. And, sometimes they'll tkae more fiddling than that. As a general rule, consider csh scripting and AWK incompatible. But, if you need to put the two together, it can be made to work.

The Basics

An AWK program basically consists of a list of patterns and actions. As each line is read in, it is checked against each pattern. If the pattern matches, the action happens. If the record doesn't match a particular pattern, the corresponding action is skipped. If a pattern matches multiple patterns, multiple actions occur in the order that they are listed. If no patterns match, nothing happens.
If an action is specified, but the corresponding pattern is left blank, this is called the null pattern. The null pattern matches each and every record.
The reserved words BEGIN and END represent special patterns BEGIN matches the beginning of the input -- before any records are read. END matches the ending of the input -- after all of the records are read. In other words the action associated with BEGIN happesn before anything else and the action associated with END happens after everything else.
An AWK program basically consists of patters and possibly functions. We're not going to discuss user-defined functions, but they are described in the AWK Manual and the Wikipedia Article. I'll throw a token example into the notes at the bottom, just for fun, though. The same is ture of multidimensional arrays.
Each time a record is read, it is represented by positional variable. These work much like the positional variables of shell scripts. $0 represents the whole record. $1, $2, $3, ..., $n represent the records, e.g., the 1st record, the 2nd record, the 3rd record, &c. Changing a positional changes the value of that field. So, if you change $2 and then print out $0 the record will be the mutated record.
The special variable NF is the number of fields within the current record. The special variable NR tells you the current record number, which, unless the RS has been changed, is the line of the file is currently being processed.
So, to take a step up from the trivial programs we've looked at so far, here's a more full example:
  #!/bin/awk -f

  # This script counts the number of one word lines
  BEGIN { \
    count=0; \
  }

  { \
    printf ("Processing line #%d\n", NR); \
  }

  NF==1 { \
    count++; \
  }

  END { \
    printf ("There are %d one word lines.\n", count); \
  }

  
Okay, in considering that example, here's what to notice:

The use of the special variable NR (Number of Records[so far])
The use of the special variable NF (Number of Fields [in current record])
The use of BEGIN for the initialization
The use of END to display the results at the end
The use of the null pattern (no pattern) to make the "Processing line..." message appear for each and every line
The C-like prinf() function
All of the gawd-aweful looking \-slash continuation characters to break up long lines
The ability to add new lines, without \-slashes, outside of {}-blocks.

Patterns can also be full regular expressions contained /within/-slashes. The ~-tilde operator checks to see if something matches a regular expression, !~ checks to ensure that something does not match.
Consider the example below. It counts the number of words containing the two vowels back-to-back.
  #!/bin/awk -f

  # This script counts the number of one lines in which the first word contains
  # back-to-back vowels
  BEGIN { \
    count=0; \
  }

  $1 ~ /.*[aeiou][aeiou].*/ { \
    count++; \
  }

  END { \
    printf ("There are %d lines in which the first word contains back-to-back vowels.\n", count); \
  }
  

Most Other Syntax

Most other syntax follows C. Examples include for-loops, while-loops, if-else, &c. Consider the following example that counts the total number of words with back-to-back vowels:
  #!/bin/awk -f

  # This script counts the number of words with back-to-back vowels in the 
  # whole file
  BEGIN { \
    count=0; \
  }

  { \
    for (i=1; i <= NF; i++) { \
      if ($i ~ /[aeiou][aeiou]/) \
        count++; \
    } \
    count++; \
  }

  END { \
    printf ("There are %d words with back-to-back vowels.\n", count); \
  }
  

Associative Arrays and the for-each loop: Really Cool Features

AWK has always impressed me for its includion of associative arrays. I'm not much of a history buff -- but it has got to be one of the earliest real programming languages to include this feature, now popularized by Perl and Python.
For those who are unfamiliar, associative arrays basically work like hash tables. They are collections of buckets indexed by, well, virtually anything.
In class, someone beat me to the punch and asked a really good question, "How do you iterate through an associative array? The anser is with a "for each" loop. For those of you who haven't seen these before, they are a recently popular feature that are also a part of many mainstream languages including C++, Perl, Python, and recently Java.
The basic idea for a for-each loop is that it is used to traverse a data structure. Generally the data structure is either an associative array or a data structure with a natural order. In the case of AWK, they are used to traverse associative arrays. The for-each loop provides a way of traversing the keys so you can visit the associative array without worrying about the fact that it is sparse. Or, heck, you can use the keys for any other purpose.
Although these loops are generall called "for-each" loops, they are read "for ___ in ___" and the syntax in AWK follows this.
The classic example of associative arrays and the for-each loop is a program that counts the number of occurrances of each word in a document. The code may vary slightly, but the game always stays the same. I can't help but use it here -- its, well, the one generations have used to teach the next:
  #!/bin/awk -f

  # This script counts the number of occurances of each word
  BEGIN { \
    count=0; \
  }

  { \
    for (i=1; i <= NF; i++) { \
      counts[$i]++; \
    } \
  }
  
  END { \
    for (word in counts) {
      printf ("%s: %d\n", word, counts[word]);
    }
  }
  
Associative arrays are also really cool, becauase they can double as regular arrays -- and even multidimensional arrays. "1" is a key, right? And, so is "1,2". Want a 2-dimensional array? Just use indexes as below:
  grid[row "," col]
  
The example above gives me an opportunity to mention one thing I didn't explicitly mention in class. AWK's concatenation operator, just like the shell's, is a space. So, the example above takes the row concatenates a ,-comma, and then concatenates the col. The resulting string is the index.
About the only gotcha' is that there is no way of traversing it "in order", because hash tables aren't ordered. The same goes when we use one-dimensional arrays as traditional arrays -- for each doesn't necessarily walk through them in order, either.
If you are using AWK within a shell script, the easy thing to do is to use UNIX's "sort" command to put things in order. Unfortunately, AWK's only real data structure is the associative array -- so it can't do this internally.

AWK Special Variables

AWK has a few special variables that can be accesed from within the program. We've mentioned some before -- but we'll put them all here for reference:

FS - The field separator. Defaults to space. Can be set directly and/or w/-F flag
RS - The record separator. Defaults to newline. Can be set directly and/or w/-F flag
NR - The number of records [processed so far]. Keeps track of the current record number, usually the same as the line number
NF - The number of fields [within the current record].
OFS - The Output Field Separator. Used to separate fields in output if printing w/print rather than formatting, e.g. printf()
ORS - The Output Record Separator. Used to separate records in output if printing w/print rather than formatting.
FILENAME - The name of the current input file, or a - (dash) for a pipe. This can't be changed. An input file can be set with the -f flag

Functions

Functions are avaialable in nawk and gawk. I promised you I'd put an example wihtin the notes. So, I'll deliver.
Before presenting it, I should mention that the way you get local variables within AWK is to declare extra arguments -- but not pass them in. Yep, you heard that right, it doesn't check the compatibility. So, you use the first arguments as arguments and the rest as locals. By convention you add a bunch of space between the two, so the caller knows what to do.
Notice that the \-slashes are gone. No continuation characters in this example. That's because function are available with nawk and gawk, but not the original awk. And, as it turn out, only the original awk requires the continuation character at the end of interrupted lines.
  #!/bin/nawk -f
  
  {
    numbers[NR] = $0;
  }

  END {
    min = minValue(numbers);
    printf ("The minimum value is %d\n", min);
  }

  function minValue (numberList,      i, minNumber) {
    minNumber == "";
    for (i in numberList) {
      if ((minNumber == "") || (minNumber > numberList[i])) {
        minNumber = numberList[i];
      }
    }
    return minNumber;
  }
  

Built-in functions

AWK has tons of built in functions. For your reference a few of the math and string functions are listed below. They'll be mostly self-explanatory. But, if you need more details, check out the references listed on top.
Math functions:

atan2(x)
cos(x)
int(x)
log(x)
sin(x)
sqrt(x)

Random functions

rand(x)
srand(x)

String functions:

gsub (regex, replacement, original) # substitution: all non-overlapping, returns number made
index (string, searchstring)
length (string)
match (string, regex)
sprintf(...)
sub (regex, replacement, original) # only first
substr(string, start, length)
tolower(strting)
toupper(string)

PERL - The Practical Extraction and Reporting Language

Perl isn't exactly a scripting language. But, we've got a spare day in the schedule -- and Perl is certainly one of the tools of the trade.
Perl is an important language for many "quick programs". It was originally designed, as its name implies, as a tool for system administrators and others to "extract and report" -- basically to process log files, &c. As a result, it has a tremendously flexible and powerful regular expression capability -- something lacking in C, C++, and, to some extent, in Java.
And, this capability, combined with a language designed to make the common case convenient, has made Perl the language of choice for not only system administrators, but also as the "glue" used by IT developers, Web developers, &c. Basically, Perl is an interpreted lanaguage -- somewhere between a shell script and a full-fledged compiled HLL (but much closer to a compiled HLL, in many respects).
Shell scripting provides an excellent way to solve complex tasks with very little effort. But, it does this by pulling together powerful programs, usually using files and pipes as IPC tools. And, these techniques can be slow and cumbersome.
By comparison, the building blocks in Perl tend to be a bit smaller, but much more integrated. As a result, shell is often excellent for solving small but complex problems quickly. Perl is often used for medium-sized problems. And, truly large problems might be better done in a compiled language. But, especially with the current availability of tremendous processing power -- economies there are often insignificant.

Hello World!

A Perl program looks much like a shell script, except the program exec'd by the shell to process the script isn't, well, a shell -- it is the Perl interpreter. And the program that it is interpretign isn't, well, written in the language of the shell -- it is written in Perl.
The program below shows the invocation of the Perl interpreter at the top of the program -- just like the shell -- and also a quick "Hello World!"
There are a few other features to note. Just like shell, comments begin with a #. Much like C, C++, or Java, lines end with a ";". And, lastly, quote are interpreted just as they are in shell. "Double quoted strings allow for the interpretation of escapes, such as the newline\n", whereas 'single quoted strings are exactly literal -- no interpretation at all.'
  
  #!/bin/perl
  
  # The usual hello world program -- an an example of a comment
  
  print `Hello world.`; # Much like C, all lines end in a ;
  

Scalar Variables

Variables in Perl are typeless. They can hold strings, as well as characters, integers, and decimal numbers. Much like in shell, "typeless" really means "stirngs available for interpretation". But, in some ways, this interprettion is more natural in Perl. For example, mathematical operations can be performed without need for an external program.
In Perl, scalar variables always have the prefix $. We'll soon see that scalars, lists and arrays have different prefixes.
I guess I should also note that literal values can be used just as in other languages, except that, for example, '3' and 3, are equivalent. Why? Everything is typeless and interpreted on the fly.

The Arithmetic operators

Perl basically uses the same set of arithmetic operators as C -- plus some:

  
  $sum = 1 + 2;
  $difference = $value1 - $value2;
  $product = 5 * $value;
  $quotient = $value1 / $value2;
  $remainder = $value1 / $value2;
  $incrementafter++;
  $++incrementbefore;

  # and, here's a new and very cool one: The power operator
  $value = $base ** $exponent;

  # The following alsow work, as usual...more soon
  # <, >, <=, >=, ==, !=

String operators

Strings in Perl seem to have been inspired by shell scripts. As we already discussed, the "" vs '' works the same way. And, variable substitution to form strings works the same way:
  
  $firstname = "Greg";
  $lastname = "Kesden";
  $fullname = "${firstname} ${lastname}";

  # and, ge, gt, le, and lt, work as with shell scripts for comparison
  
  
In addition, the "." and "x" operators are also lots of fun. "." is concatenate and "x" literally causes a string to be repeated. Incidentally, the ".=" operator works just fine, too.
Please note: Although I don't think we discussed it in the context of shell scripts, the ${var} notation is also part of shell. It is used to offset the name of a variable name within a string. The reason is that in some cases, the string and the variable name would otherwise become impossible to distinguish, $varfollowedbysomethingelse, for example.
  
  $fullname = $firstname." ".$lastname;
  $treepeat = "Ninety-nine barrels of beer" x 3;
  
  
One important note for Java programmers is that Perl treats strings as values, not objects. So, when strings are assigned, values are copied, not aliased via references.

Arrays

Perl provides traditional indexed arrays -- with some really cool operators. In Perl, array variables begin with an @ instead of a $. But, when referencing Arrays the $ is used, because the value is that of a single element, not the entire array. Array indexing begins with 0. As is the case with the Java ArrayList or C++ STL Vector, Perl arrays grow dynamically.
The following code segment declares an empty array and also demonstrates the creation of an array with several initialized elements and the access, by index, of a single item within the array:
  
  @winners = (); # An empty array
  @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array

  print "${contestants[2]}\n"; # Prints Rich
  
  

Array operations

The push and pop operations should be pretty intuitive once you're thinking in the right context: think LIFO stack. push add an item to the end (high index) of the array and then returns the length of the array. pop removes the last item and returns it. $#arrayname returns the index of the last item in the array -- not its length.
  
  @winners = (); # An empty array
  @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array

  print "${contestants[2]}\n"; # Prints Rich

  $winner = pop (@contestants);
  push (@winners, $winner);

  print "@winners\n";
  print "@contestants\n";

  print $#contestants
  
  
Arrays can also be used to check this one out:
  $wildcard = "Jeff";

  @winners = (); # An empty array
  @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array

  print "${contestants[2]}\n"; # Prints Rich

  $winner = pop (@contestants);
  push (@winners, $winner);

  print "@winners\n";
  print "@contestants\n";

  @nextcontestants = ($wildcard, @winners);

  print "@{nextcontestants}\n";
  
  
Here's another nice array trick: Initializing a string from an array:
  
  @namesarray = ("Greg", "Mark", "Jeff", "Rich", "Tim");
  $names = "@namesarray";
  print "${names}\n";
  
  
But, we need to be careful. Check out the example below. Notice the absence of the quotes. This will assign the length of the array to the variable on the left:
  
  @namesarray = ("Greg", "Mark", "Jeff", "Rich", "Tim");
  $count = @namesarray;
  print "${count}\n";
  
  
Arrays can also be used in a bizzar way to make parallel assignments:
  # $item1 = $item1prime
  # $item2 = $item2prime
  ($item1, $item2) = ($item1prme, $item2prime);
  

Conditionals

Conditionals in perl work exactly like conditionals in C, except that they also offer the optional elsif construct that we saw in shell scripting:
  if ($x == $y) {
    # blah blah
  }
  else {
    # ha ha 
  }



  if ($x == $y) {
    # blah blah
  }
  elsif ($x == $a) {
    # blah blah
  }
  elsif ($x = $b) {
    # blah blah
  } 
  else {
    # blah blah
  }
  

The Traditional for and while loops

Perl has for and while loops that exactly mimic the syntax of C, C++, or Java:
  for ($count=0; $count < 10; $count++) {
    print "$count\n";
  }


  
  while ( $option ne "Quit") {
    dosomethinguseful();
  }
  

The foreach Loop

Perl also has a special for of the for loop designed to make array access more convenient. It basically allows you to walk through the array in an iterator-like fashion. Below is an example written with each of a traditional and foreach loop.
The new-fangled, but super-convenient foreach version:
  @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array

  foreach $contestant (@contestants) {
    print "$contestant\n";
  }
  
  
The venerable, familiar, good ole' fashion version:
  @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array

  for ( $count=0; $count <= $#contestants; $count++) {
    print "${contestants[$count]}\n";
  }
  
  

Files

Perl file manipulation will be very familiar to those who have worked in Java or C. Files are manipulated through file handles, which are basically special identifiers used for open files. They are not prefixed with a $ and, by convention, they are written in all CAPITALs.
Below is a pretty typical example. It opens $filename, as DATA_FILE and then uses the <> operator to construct an array representation. The file is then clsoed and subsequently printed from the array.
  # open the file named $filename and associate it with the handle DATA_FILE
  open (DATA_FILE, $filename);

  # Read each line of the file into the array @lines
  @lines = <DATA_FILE>;

  # Iterate through the lines, printing each one
  foreach $line (@lines) {
    print "$line";   # Notice: No \n. This is already at the end of the $line
  }

  # Close the file
  close (DATA_FILE);
  
Output to a file is often handled simply by specifying the handle before the formatting string of a print, as follows:
  print OUTPUT_FILE "This will land in the file!";
  
In addition to the unrestricted open show above, Perl allows files to be opened for only limited types of access. This is done by placing a <, >, <<, +<, +>, or +>> before the $file, as shown below:

open (DATA_FILE, $filename); # unrestricted
open (DATA_FILE, <$filename); # Read-only
open (DATA_FILE, +<$filename); # Read/Write, must exist prior
open (DATA_FILE, >$filename); # Write/create
open (DATA_FILE, +>$filename); # Read/Write/Create
open (DATA_FILE, >>$filename); # Append
open (DATA_FILE, +>>$filename); # Read/Write/Create/Append

Additionally, it shouldn't come as a surprise that STDIN, STDOUT, and STDERR are predefined file handles. They work as expect.
The last thing I want to mention is that redirection is really just another form of file input. $filename can be replaced with "| command" or "command |", for output and input piping, respectively:

# Send output to DATA_FILE to sort, which sends its output to a file
open (DATA_FILE, '| sort > sortedfile.txt');
# Invoke "ls -l" and tie it to DATA_FILE for input
open (DATA_FILE, 'ls -l |');

Regular Expressions and String Manipulation

these days, there are many great reasons to program in perl. One of those happens to be the first among those: its natural ability to play with strings and, in particular, regular expressions.
The following two operators, =~ (match) and !~ (no match), are among the most basic. =~ returns the number of times a substring matching the regular expression is found in the supplied string. Sometimes it is interpreted as a true/false expression, where 0 matches is false (not found). The "not in" opertor !~ retunrs true if no matches are found.
The general forms are as follows:
    $nummatches = ($somestring =~ /regular expression/); 
    $notin = ($somestring !~ /regular expression/); 
  
perl also has a special variable, $_, which represents the default string. Several important operators act on this string by default. For example, perl can do sed-style searching and replacing. When this type of expression is defined, it is acting upon $_:
  $_ = "This is an example string: Hello World";

  $changes = s/World/WORLD/g;

  print "$_\n"; # "World" is now WORLD 

  print "$changes\n"; # The number of substitutions made; in this case, 1
  
The tr function is also very powerful. It acts much like the tr command. It allows the user to define a mapping of character-for-character substitutions and applies them to $_. Each character in the first field will be replaced by the corresponding character in the second filed. As with th s function above, it returns the number of substitutions:
  $changes = tr/abc/123/; # a becomes 1, b becomes 2, c becomes 3
  
Please note: In the examples above, there are no quotes around the tr and s expressions. This is important. If the expressions are quoted, they'll be interpreted as strings and assigned, instead of interpreted as regex operations and performed.