September 6, 2007 (Lecture 4)

September 6, 2007 (Lecture 4)

Good References

AWK - An Overview

AWK is an early string processing language that has passed the test of time. It is named after Aho, Weinberger, and Kernighan, all highly esteemed computer scientists whose contributions are many.
In describing the purpose of AWK it would sound very much like I was describing the purpose of Perl, with which many of you are probably familiar. AWK is a language designed for processing strings and is often used to extract, process, or otherwise munge data from reports or other structured text documents. Until the advent of Perl, AWK and sed were the workhorses of text processing.
These days, AWK is showing some of its age. It was designed at a time when processing power was scare and compilers were weak. As a result, its syntax isn't always user friendly and it can appear quite dense. For this reason, these days, Perl is the prefered choice for most projects medium small and up.
But, because AWK is so dense -- in its sweet spots, it can accomplish complex tasks with very little code, it remains the tool of choice for those "small jobs" that might well be big jobs elsewhere. Short AWK programs can be standalone, but are often incorporated within scripts. Many are just a few lines -- and the one-liners are often the most impressive.
With the advent of Perl, Python, and the like, AWK hasn't become less popular -- its just seen more targetted use as a quick and dirty tool.

Versions of AWK

There are basically three versions of AWK running around. The orignal, AT&T version, "awk", a newer AT&T version, "nawk", and a version from GNU, "gawk".
In general, the original "awk" isn't really the stanard anymore -- most systems have at least "nawk". And, anyone who is anyone, installs GNU's gawk. If you're using Linux -- you're using gawk. gawk contains almost everything from nawk -- and some more. Just be careful to avoid using gawk-only features if you want compatibility with nawk.

The Basic Paradigm

AWK views its input as a collection of records, each of which contains fields. By default, each line is viewed as a record and fields are delimited by whitespace. But, the field separator (FS) and field separator (FS) can be changed.
AWK is basically event driven. It works its way through its input, from top to bottom, processing one record at a time. As it charges through the records, it can make use of global variables to keep track of things and functions to make things reusable and readable. Its only real data structure is the associative array, e.g. hash table, which can also double as a traditional indexed array (although it might not be as dense), whether single or multidimensional.
Each and every record of a file is processed. But, actions can be dependent on the record patching a certain pattern, for example, the first field containing a particular userid.

Feeding AWK its Programs

Perhaps the easiest way to write an AWK program is to put the program right on the command line. The following program prints the first field of each record. Notice that the program, 'surrounded by ticks' is given to awk as a command-line argument.
  cat file.txt awk ' { print $1 } '
  
Another option it to ask awk to read its program from a source file. This is done with the -f option:
  cat file.txt awk -f sourcefile.awk
  
The last option is to use AWK as an interpreter. In other words, we can create a first-class awk script, much like we did a shell script. Remember the mechanism at work here. A #! at the beginning of the file signals that what follows is the interpreter to run. Then, when this interpreter runs, it is given as its argument the name of the file so it can process it. An AWK scrpt works exactly like this:
  #!/bin/awk -f
  { print $1 }
  
When AWK, especially AWK one-liners are embedded in shell scripts, the program is often passed on the commandline. When a whole application is being developed in AWK, it is almost always written as an AWK script. Sometimes AWK programs are passed using the -f option when they are part of shell scripts, but too large to be squeezed onto a single line. There is no Right Way -- just what feels right.
The only place that newlines can appear within a program is immediately after }-closed-braces. If, for readability, lines need to be broken up, newlines can be added -- but only if the preceding line ends with the \-slash. Used in this context, the \-slash is known as the "continuation character". It must be the last thing on the line -- be careful about stray whitespace.
Because of this, AWK and csh shell scripts are not friendly. csh handles \-slashes differently. At a minimum, they will need to be escaped as \\. And, sometimes they'll tkae more fiddling than that. As a general rule, consider csh scripting and AWK incompatible. But, if you need to put the two together, it can be made to work.

The Basics

An AWK program basically consists of a list of patterns and actions. As each line is read in, it is checked against each pattern. If the pattern matches, the action happens. If the record doesn't match a particular pattern, the corresponding action is skipped. If a pattern matches multiple patterns, multiple actions occur in the order that they are listed. If no patterns match, nothing happens.
If an action is specified, but the corresponding pattern is left blank, this is called the null pattern. The null pattern matches each and every record.
The reserved words BEGIN and END represent special patterns BEGIN matches the beginning of the input -- before any records are read. END matches the ending of the input -- after all of the records are read. In other words the action associated with BEGIN happesn before anything else and the action associated with END happens after everything else.
An AWK program basically consists of patters and possibly functions. We're not going to discuss user-defined functions, but they are described in the AWK Manual and the Wikipedia Article. I'll throw a token example into the notes at the bottom, just for fun, though. The same is ture of multidimensional arrays.
Each time a record is read, it is represented by positional variable. These work much like the positional variables of shell scripts. $0 represents the whole record. $1, $2, $3, ..., $n represent the records, e.g., the 1st record, the 2nd record, the 3rd record, &c. Changing a positional changes the value of that field. So, if you change $2 and then print out $0 the record will be the mutated record.
The special variable NF is the number of fields within the current record. The special variable NR tells you the current record number, which, unless the RS has been changed, is the line of the file is currently being processed.
So, to take a step up from the trivial programs we've looked at so far, here's a more full example:
  #!/bin/awk -f

  # This script counts the number of one word lines
  BEGIN { \
    count=0; \
  }

  { \
    printf ("Processing line #%d\n", NR); \
  }

  NF==1 { \
    count++; \
  }

  END { \
    printf ("There are %d one word lines.\n", count); \
  }

  
Okay, in considering that example, here's what to notice:

The use of the special variable NR (Number of Records[so far])
The use of the special variable NF (Number of Fields [in current record])
The use of BEGIN for the initialization
The use of END to display the results at the end
The use of the null pattern (no pattern) to make the "Processing line..." message appear for each and every line
The C-like prinf() function
All of the gawd-aweful looking \-slash continuation characters to break up long lines
The ability to add new lines, without \-slashes, outside of {}-blocks.

Patterns can also be full regular expressions contained /within/-slashes. The ~-tilde operator checks to see if something matches a regular expression, !~ checks to ensure that something does not match.
Consider the example below. It counts the number of words containing the two vowels back-to-back.
  #!/bin/awk -f

  # This script counts the number of one lines in which the first word contains
  # back-to-back vowels
  BEGIN { \
    count=0; \
  }

  $1 ~ /.*[aeiou][aeiou].*/ { \
    count++; \
  }

  END { \
    printf ("There are %d lines in which the first word contains back-to-back vowels.\n", count); \
  }
  

Most Other Syntax

Most other syntax follows C. Examples include for-loops, while-loops, if-else, &c. COnsider the following example that counts the total number of words with back-to-back vowels:
  #!/bin/awk -f

  # This script counts the number of words with back-to-back vowels in the 
  # whole file
  BEGIN { \
    count=0; \
  }

  { \
    for (i=1; i <= NF; i++) { \
      if ($i ~ /[aeiou][aeiou]/) \
        count++; \
    } \
    count++; \
  }

  END { \
    printf ("There are %d words with back-to-back vowels.\n", count); \
  }
  

Associative Arrays and the for-each loop: Really Cool Features

AWK has always impressed me for its includion of associative arrays. I'm not much of a history buff -- but it has got to be one of the earliest real programming languages to include this feature, now popularized by Perl and Python.
For those who are unfamiliar, associative arrays basically work like hash tables. They are collections of buckets indexed by, well, virtually anything.
In class, someone beat me to the punch and asked a really good question, "How do you iterate through an associative array? The anser is with a "for each" loop. For those of you who haven't seen these before, they are a recently popular feature that are also a part of many mainstream languages including C++, Perl, Python, and recently Java.
The basic idea for a for-each loop is that it is used to traverse a data structure. Generally the data structure is either an associative array or a data structure with a natural order. In the case of AWK, they are used to traverse associative arrays. The for-each loop provides a way of traversing the keys so you can visit the associative array without worrying about the fact that it is sparse. Or, heck, you can use the keys for any other purpose.
Although these loops are generall called "for-each" loops, they are read "for ___ in ___" and the syntax in AWK follows this.
The classic example of associative arrays and the for-each loop is a program that counts the number of occurrances of each word in a document. The code may vary slightly, but the game always stays the same. I can't help but use it here -- its, well, the one generations have used to teach the next:
  #!/bin/awk -f

  # This script counts the number of occurances of each word
  BEGIN { \
    count=0; \
  }

  { \
    for (i=1; i <= NF; i++) { \
      counts[$i]++; \
    } \
  }
  
  END { \
    for (word in counts) {
      printf ("%s: %d\n", word, counts[word]);
    }
  }
  
Associative arrays are also really cool, becauase they can double as regular arrays -- and even multidimensional arrays. "1" is a key, right? And, so is "1,2". Want a 2-dimensional array? Just use indexes as below:
  grid[row "," col]
  
The example above gives me an opportunity to mention one thing I didn't explicitly mention in class. AWK's concatenation operator, just like the shell's, is a space. So, the example above takes the row concatenates a ,-comma, and then concatenates the col. The resulting string is the index.
About the only gotcha' is that there is no way of traversing it "in order", because hash tables aren't ordered. The same goes when we use one-dimensional arrays as traditional arrays -- for each doesn't necessarily walk through them in order, either.
If you are using AWK within a shell script, the easy thing to do is to use UNIX's "sort" command to put things in order. Unfortunately, AWK's only real data structure is the associative array -- so it can't do this internally.

AWK Special Variables

AWK has a few special variables that can be accesed from within the program. We've mentioned some before -- but we'll put them all here for reference:

FS - The field separator. Defaults to space. Can be set directly and/or w/-F flag
RS - The record separator. Defaults to newline. Can be set directly and/or w/-F flag
NR - The number of records [processed so far]. Keeps track of the current record number, usually the same as the line number
NF - The number of fields [within the current record].
OFS - The Output Field Separator. Used to separate fields in output if printing w/print rather than formatting, e.g. printf()
ORS - The Output Record Separator. Used to separate records in output if printing w/print rather than formatting.
FILENAME - The name of the current input file, or a - (dash) for a pipe. This can't be changed. An input file can be set with the -f flag

Functions

Functions are avaialable in nawk and gawk. I promised you I'd put an example wihtin the notes. So, I'll deliver.
Before presenting it, I should mention that the way you get local variables within AWK is to declare extra arguments -- but not pass them in. Yep, you heard that right, it doesn't check the compatibility. So, you use the first arguments as arguments and the rest as locals. By convention you add a bunch of space between the two, so the caller knows what to do.
Notice that the \-slashes are gone. No continuation characters in this example. That's because function are available with nawk and gawk, but not the original awk. And, as it turn out, only the original awk requires the continuation character at the end of interrupted lines.
  #!/bin/nawk -f
  
  {
    numbers[NR] = $0;
  }

  END {
    min = minValue(numbers);
    printf ("The minimum value is %d\n", min);
  }

  function minValue (numberList,      i, minNumber) {
    minNumber == "";
    for (i in numberList) {
      if ((minNumber == "") || (minNumber > numberList[i])) {
        minNumber = numberList[i];
      }
    }
    return minNumber;
  }
  

Built-in functions

AWK has tons of built in functions. For your reference a few of the math and string functions are listed below. They'll be mostly self-explanatory. But, if you need more details, check out the references listed on top.
Math functions:

atan2(x)
cos(x)
int(x)
log(x)
sin(x)
sqrt(x)

Random functions

rand(x)
srand(x)

String functions:

gsub (regex, replacement, original) # substitution: all non-overlapping, returns number made
index (string, searchstring)
length (string)
match (string, regex)
sprintf(...)
sub (regex, replacement, original) # only first
substr(string, start, length)
tolower(strting)
toupper(string)