Return to lecture notes index
September 2, 2010 (Lecture 4)

Predicates

The convention among UNIX programmers is that programs should return a 0 upon success. Typically a non-0 value indicates that the program couldn't do what was requested. Some (but not all) programmers return a negative number upon an error, such as file not found, and a positive number upon some other terminal condition, such as the user choosing to abort the request.

As a result, the shell notion of true and false is a bit backward from what most of us might expect. 0 is considered to be true and non-0 is considered to be false.

We can use the test to evaluate an expression. The following example will print 0 if gkesden is the user and 1 otherwise. It illustrates not only the test but also the use of the status variable. status is automatically set to the exit value of the most recently exited program. The notation $var, such as $test, evaluates the variable.


  test "$LOGNAME" = gkesden
  echo $?
  

Shell scripting languages are typeless. By default everything is interpreted as a string. So, when using variables, we need to specify how we want them to be interpreted. So, the operators we use vary with how we want the data interpreted.

Operators for strings, ints, and files
string x = y, comparison: equal x != y, comparison: not equal x, not null/not 0 length -n x, is null
ints x -eq y, equal x -ge y, greater or equal x -le y, lesser or equal x -gt y, strictly greater x -lt y, strictly lesser x -ne y, not equal
file -f x, is a regular file -d x, is a directory -r x, is readable by this script -w x, is writeable by this script -x x, is executible by this script
logical x -a y, logical and, like && in C (0 is true, though) x -o y, logical or, like && in C (0 is true, though)

[ Making the Common Case Convenient ]

We've looked at expressions evaluated as below:

  test -f somefile.txt
  

Although this form is the canonical technique for evaluating an expression, the shorthand, as shown below, is universally supported -- and much more reasonable to read:

  [ -f somefile.txt ]
  

You can think of the [] operator as a form of the test command. But, one very important note -- there must be a space to the inside of each of the brackets. This is easy to forget or mistype. But, it is quite critical.

Making Decisions

Like most programming languages, shell script supports the if statement, with or without an else. The general form is below:

  if command
  then
      command
      command
      ...
      command
  else
      command
      command
      ...
      command
  fi
  
if command then command command ... command fi

The command used as the predicate can be any program or expression. The results are evaluated with a 0 return being true and a non-0 return being false.

If ever there is the need for an empty if-block, the null command, a :, can be used in place fo a command to keep the syntax legal.

The following is a nice, quick example of an if-else:

  if [ "$LOGNAME" = "gkesden" ]
  then
    printf "%s is logged in" $LOGNAME
  else
    printf "Intruder! Intruder!"
  fi
  

The elif construct

Shell scripting also has another construct that is very helpful in reducing deep nesting. It is unfamilar to those of us who come from languages like C and Perl. It is the elif, the "else if". This probably made its way into shell scripting because it drastically reduces the nesting that would otherwise result from the many special cases that real-world situatins present -- without functions to hide complexity (shell does have functions, but not parameters -- and they are more frequently used by csh shell scripters than traniditonalists).

  if command
    command
    command
    ...
    command
  then
    command
    command
    ...
    command
  elif command
  then
    command
    command
    ...
    command
  elif command
  then
    command
    command
    ...
    command
  fi
  

The switch statement

Much like C, C++, or Java, shell has a case/switch statement. The form is as follows:

  case var
  in
  pat) command
              command
              ...
              command
              ;; # Two ;;'s serve as the break
  pat) command
              command
              ...
              command
              ;; # Two ;;'s serve as the break
  pat) command
              command
              ...
              command
              ;; # Two ;;'s serve as the break
  esac
  

Here's a quick example:

   #!/bin/sh

   echo $1
   
   case "$1"
   in
     "+") ans=`expr $2 + $3`
          printf "%d %s %d = %d\n" $2 $1 $3 $ans
         ;;
     "-") ans=`expr $2 - $3`
          printf "%d %s %d = %d\n" $2 $1 $3 $ans
         ;;
     "\*") ans=`expr "$2 * $3"`
          printf "%d %s %d = %d\n" $2 $1 $3 $ans
         ;;
     "/") ans=`expr $2 / $3`
          printf "%d %s %d = %d\n" $2 $1 $3 $ans
         ;;

     # Notice this: the default case is a simple *
     *) printf "Don't know how to do that.\n"
         ;;
  

The for Loop

The for loop provides a tool for processing a list of input. The input to the for loop is a list of values. Each trip through the loop it extracts one value into a varible and then enters the body of the loop. the loop stops when the extract fails because there are no more values in the list.

Let's consider the following example which prints each of the command line arguments, one at a time. We'll extract them from "$@" into $arg:

  for var in "$@"
  do
    printf "%s\n" $var
  done
  

Much like C or Java, shell has a break command, also. As you might guess, it can be used to break out of a loop. Consider this example which stops printing command line arguments, when it gets to one whose value is "quit":

  for var in "$@"
  do
    if [ "$var" = "quit" ]
    then
      break
    fi
    printf "%s\n" $var
  done
  

Similarly, shell has a continue that works just like it does in C or Java. This one can be used to censor me!

  for var in "$@"
  do
    if [ "$var" = "shit" ]
    then
      continue
    elif [ "$var" = "fuck" ]
    then
      continue
    elif [ "$var" = "damn" ]
    then
      continue
    fi
    if [ "$var" = "quit" ]
    then
      break
    fi
    printf "%s\n" $var
  done
  

The while and until Loops

Shell has a while loop similar to that seen in C or Java. It continues until the predicate is false. And, like the other loops within shell, break and continue can be used. Here's an example of a simple while loop:

  # This lists the files in a directory in alphabetical order
  # It continues until the read fails because it has reached the end of input

  ls | sort |
  while read file
  do
    echo $file
  done
  

There is a similar loop, the until loop that continues until the condition is successful -- in other words, while the command failes. This will pound the user for input until it gets it:

  printf "ANSWER ME! "
  until read $answer
  do
    printf "ANSWER ME! "
  done
  

Regular Expressions

Much of what we do with shell scripts involves finding things and acting on them. We already know how we can find and act upon exact matches. But many aspects fo the shell language, such as the case allow more open-ended matching. And the same is especially true of popular tools, such as grep and sed.

This more flexible mapping is implemented through a language known as regular expressions. Using regular expressions, you can combine regular text and special symbols to match only certain things.

Special symbols aside, regular expressions match literally. So, things match only if they are exactly the same. Special symbols, including symbols: ., *, $, ^, \, -, [, ], need to be escaped using \ in order to be interpreted literally.

Regular expressions are composed of literals and metcharacters. Most symbols used within regular expressions are literals, which match exactly themselves. Literals, for example, include, in most contexts, all of the letters and numbers. Metacharacters are symbols that have special meanings. Some of the more common metacharacters are listed below:

SymbolMeaning
.Matches any single character
^Matches the beginning of a line
$Matches the end of a line
[]Matches any one of the symbols within the brackets, hypen (-) notation for ranges okay
[^]Matches things not within the list
*Matches 0 or more of the pattern to the immediate left
{x,y}Matches between x and y of the pattern to the immediate left, inclusive
(x)Save a match of pattern x to a register, also known as "marking" an expression.
\nregister #n (there are registers 1-9)
&The matched string. In other words, the literal string that mached the pattern you supplied
<word>the word, not any substring word

Notes:

Examples of Regular Expressions

Non-standard Features, Irregular Expressions

Not all regular expression implementations are equal. For example, grep does not accept {ranges}, or allow the recall of marked expressions. The reason for this is that these features, although long-time features of varous "regular expressions" aren't regular.

"Regular expressions" are so-known because any language recognizable by a true regular expression is a regular language. In other words, if a language is recognizable by a regular expression, it should also be recognizable by a FSM, whether a DFA or an NFA. Ranges can be regular -- if they havea finite range. But, the marking and recall of regions is entirely incompatible with a regular language. DFAs and NFAs are equivalently powerful -- and have no memory beyond their states and transitions. There is no way of remembering an expression for use later. This would only be possible if there existed a unique state for each remembered expression. And, that isn't possible if expressions can have an infinite length -- an infinite number of states would be required.

The practical consequence of this is that "regular" expressions that implement this feature aren't regular and need to be implemented with a technique such as backtracing instead of through an DFA or NDA. The consequence is that the complexity goes through the roof -- and the running time can go down the drain.

As a consequence, some implementations of "regular expressions" leave out this type of feature in order to remain regular and protect the recognizer and bound on the running time.

A Note On Escaping

What if we want to match a *-asterisk? Or a ^-carrot? In order to match the literal value of character that happens to be a metacharacter, it is necessary to "Escape it". This is done by placing a \-slash before the character, for example, "\." And, if one wants a literal slash -- well, slash-slash, "\\", of course.

On thing that often confuses those new to Java is that, in order to match, for example, a tab, the \, itself, must be escaped -- "\\t". The reason for this is that, unless the \-slash is escaped, it will be translated into the escaped character before it makes its way into the library functin. This is true, of course, for all of the escaped characters (\d, \r, etc).

Greedy vs Releuctant vs Posesssive

Let's consider the quantifiers: ?, *, {x,y}, and +. If we construct an expression that contains pairs of these back-to-back, we can run into interesting questions. For example, please consider the following regular expression:

  (.*)([a-zA-Z]*)(.*)
  

And, the following input strings:

  12345abcDEF67890
  abcdefghijklmnop
  

In the first case, it is fairly clear that "1234" needs to be matched against the initial ".*". But, beyond that, it becomes less clear. Since the ".*" specifies, in effect, 0 or more of anything, it can eat the entire input. Should this happen, the subsequent "[a-zA-Z]*" and ".*", which request zero or more, can be satisfied with nothing. Or, as another example, the first group could match "12345abcD", leaving "E" for the second group, which could subsequently leave "F67890" for the third group. How are these conflicts resolved?

Most prograsm use greedy regular expressions by default. This means that, moving from left to right, each greedy quantifier, will match as much as it can -- without breaking the rest of the expression.