January 22, 2008 (Lecture 3)

January 22, 2008 (Lecture 3)

Setting Positionals

Unlike other variables, positions can't be assigned values using the = operator. Instead, they can only be changed in a very limited way.
The set command sets these values. Consider the following example:
  set a b c
  # $1 is now a
  # $2 is now b
  # $3 is now c
  
One thing that should be noted about the set command. It accepts arguments, itself. These begin with the - sign. As a result, it can get confused and begin to interprete values that it should be assigning to positionals. To avoid this, the -- flag can be used:
  set -- -a- -b- -c- 
  # $1 is now -a-
  # $2 is now -b-
  # $3 is now -c-
  
If there are more than 9 command-line arguments, there is a bit of a problem -- there are onyl 9 positionals: $1, $2, ..., $9. $0 is special and is the shell script's name.
To address this problem, the shift command can be used. It shifts all of the arguments to the left, throwing away $1. What would otherwise have been $10 becomes $9 -- and addressible. We'll talk more about shift after we've talked about while loops.

Quotes, Quotes, and More Quotes

Shell scripting has three different styles of quoting -- each with a diffent meaning:

unquoted strings are normally interpreted
"quoted strings are basically literals -- but $variables are evaluated"
'quoted strings are absolutely literally interpreted'
`commands in quotes like this are executed, their output is then inserted as if it were assigned to a variable and then that variable was evaluated`

I think "quotes" and 'quotes' are pretty straight-forward -- and will be constantly reinforced. But, I do want to show an example using `quotes`:
  day=`date | cut -d" " -f1`
  printf "Today is %s.\n" $day
  

expr

The expr program can be used to manipulate variables, normally interpreted as strings, as integers. Consider the following "adder" script:

  sum=`expr $1 + $2`

  printf "%s + %s = %s\n" $1 $2 $sum

A Few Other Special Variables

$? - the exit status of the last program to exit
$$ - The shell's pid

Predicates

The convention among UNIX programmers is that programs should return a 0 upon success. Typically a non-0 value indicates that the program couldn't do what was requested. Some (but not all) programmers return a negative number upon an error, such as file not found, and a positive number upon some other terminal condition, such as the user choosing to abort the request.
As a result, the shell notion of true and false is a bit backward from what most of us might expect. 0 is considered to be true and non-0 is considered to be false.
We can use the test to evaluate an expression. The following example will print 0 if gkesden is the user and 1 otherwise. It illustrates not only the test but also the use of the status variable. status is automatically set to the exit value of the most recently exited program. The notation $var, such as $test, evaluates the variable.
  test "$LOGNAME" = "gkesden"
  echo $?
  
Shell scripting languages are typeless. By default everything is interpreted as a string. So, when using variables, we need to specify how we want them to be interpreted. So, the operators we use vary with how we want the data interpreted.

Operators for strings, ints, and files

string x = y, comparison: equal x != y, comparison: not equal x, not null/not 0 length -n x, is null

ints x -eq y, equal x -ge y, greater or equal x -le y, lesser or equal x -gt y, strictly greater x -lt y, strictly lesser x -ne y, not equal

file -f x, is a regular file -d x, is a directory -r x, is readable by this script -w x, is writeable by this script -x x, is executible by this script

logical x -a y, logical and, like && in C (0 is true, though) x -o y, logical or, like && in C (0 is true, though)

Operators for strings, ints, and files
string	x = y, comparison: equal	x != y, comparison: not equal	x, not null/not 0 length	-n x, is null
ints	x -eq y, equal	x -ge y, greater or equal	x -le y, lesser or equal	x -gt y, strictly greater	x -lt y, strictly lesser	x -ne y, not equal
file	-f x, is a regular file	-d x, is a directory	-r x, is readable by this script	-w x, is writeable by this script	-x x, is executible by this script
logical	x -a y, logical and, like && in C (0 is true, though)	x -o y, logical or, like && in C (0 is true, though)

[ Making the Common Case Convenient ]

We've looked at expressions evaluated as below:
  test -f somefile.txt
  
Although this form is the canonical technique for evaluating an expression, the shorthand, as shown below, is universally supported -- and much more reasonable to read:
  [ -f somefile.txt ]
  
You can think of the [] operator as a form of the test command. But, one very important note -- there must be a space to the inside of each of the brackets. This is easy to forget or mistype. But, it is quite critical.

Making Decisions

Like most programming languages, shell script supports the if statement, with or without an else. The general form is below:
  if command
  then
      command
      command
      ...
      command
  else
      command
      command
      ...
      command
  fi
  
  if command
  then
      command
      command
      ...
      command
  fi
  
The command used as the predicate can be any program or expression. The results are evaluated with a 0 return being true and a non-0 return being false.
If ever there is the need for an empty if-block, the null command, a :, can be used in place fo a command to keep the syntax legal.
The following is a nice, quick example of an if-else:
  if [ "$LOGNAME" = "gkesden" ]
  then
    printf "%s is logged in" $LOGNAME
  else
    printf "Intruder! Intruder!"
  fi
  

The elif construct

Shell scripting also has another construct that is very helpful in reducing deep nesting. It is unfamilar to those of us who come from languages like C and Perl. It is the elif, the "else if". This probably made its way itno shell scripting because it drastically reduces the nesting that would otherwise result from the many special cases that real-world situatins present -- without functions to hide complexity (shell does have functions, but not parameters -- and they are more frequently used by csh shell scripters than traniditonalists).
  if command
    command
    command
    ...
    command
  then
    command
    command
    ...
    command
  elif command
  then
    command
    command
    ...
    command
  elif command
  then
    command
    command
    ...
    command
  fi
  

The switch statement

Much like C, C++, or Java, shell has a case/swithc statement. The form is as follows:

  case var
  in
  pat) command
              command
              ...
              command
              ;; # Two ;;'s serve as the break
  pat) command
              command
              ...
              command
              ;; # Two ;;'s serve as the break
  pat) command
              command
              ...
              command
              ;; # Two ;;'s serve as the break
  esac

Here's a quick example:

   #!/bin/sh

   echo $1
   
   case "$1"
   in
     "+") ans=`expr $2 + $3`
          printf "%d %s %d = %d\n" $2 $1 $3 $ans
         ;;
     "-") ans=`expr $2 - $3`
          printf "%d %s %d = %d\n" $2 $1 $3 $ans
         ;;
     "\*") ans=`expr "$2 * $3"`
          printf "%d %s %d = %d\n" $2 $1 $3 $ans
         ;;
     "/") ans=`expr $2 / $3`
          printf "%d %s %d = %d\n" $2 $1 $3 $ans
         ;;

     # Notice this: the default case is a simple *
     *) printf "Don't know how to do that.\n"
         ;;

The for Loop

The for loop provides a tool for processing a list of input. The input to the for loop is a list of values. Each trip through the loop it extracts one value into a varible and then enters the body of the loop. the loop stops when the extract fails because there are no more values in the list.
Let's consider the following example which prints each of the command line arguments, one at a time. We'll extract them from "$@" into $arg:
  for var in "$@"
  do
    printf "%s\n" $var
  done
  
Much like C or Java, shell has a break command, also. As you might guess, it can be used to break out of a loop. Consider this example which stops printing command line arguments, when it gets to one whose value is "quit":
  for var in "$@"
  do
    if [ "$var" = "quit" ]
    then
      break
    fi
    printf "%s\n" $var
  done
  
Similarly, shell has a continue that works just like it does in C or Java. This one can be used to censor me!
  for var in "$@"
  do
    if [ "$var" = "shit" ]
    then
      continue
    elif [ "$var" = "fuck" ]
    then
      continue
    elif [ "$var" = "damn" ]
    then
      continue
    fi
    if [ "$var" = "quit" ]
    then
      break
    fi
    printf "%s\n" $var
  done
  

The while and until Loops

Shell has a while loop similar to that seen in C or Java. It continues until the predicate is false. And, like the other loops within shell, break and continue can be used. Here's an example of a simple while loop:
  # This lists the files in a directory in alphabetical order
  # It continues until the read fails because it has reached the end of input

  ls | sort |
  while read file
  do
    echo $file
  done
  
There is a similar loop, the until loop that continues until the condition is successful -- in other words, while the command failes. This will pound the user for input until it gets it:
  printf "ANSWER ME! "
  until read $answer
  do
    printf "ANSWER ME! "
  done
  

Regular Expressions

Much of what we do with shell scripts involves finding things and acting on them. We already know how we can find and act upon exact matches. But many aspects fo the shell language, such as the case allow more open-ended matching. And the same is especially true of popular tools, such as grep and sed.
This more flexible mapping is implemented through a language known as regular expressions. Using regular expressions, you can combine regular text and special symbols to match only certain things.
Special symbols aside, regular expressions match literally. So, things match only if they are exactly the same. Special symbols, including symbols: ., *, $, ^, \, -, [, ], need to be escaped using \ in order to be interpreted literally.
Here are some special symbols and their meanings:

Symbol Meaning

. Matches any single character

^ Matches the beginning of a line

$ Matches the end of a line

[] Matches any one of the symbols within the brackets, hypen (-) notation for ranges okay

[^] Matches things not within the list

* Matches 0 or more of the pattern to the immediate left

{x,y} Matches between x and y of the pattern to the immediate left, inclusive

(x) Save a match of pattern x to a register, also known as "marking" an expression.

\n register #n (there are registers 1-9)

& The matched string. In other words, the literal string that mached the pattern you supplied

<word> the word, not any substring word

Notes:

Unfortunately, neither "grep" nor "egrep" implement the {} or () patterns.
() and {} almost always need to be escaped when used as patterns, but not escaped otherwise. This is the opposite of what one might expect.

Symbol	Meaning
.	Matches any single character
^	Matches the beginning of a line
$	Matches the end of a line
[]	Matches any one of the symbols within the brackets, hypen (-) notation for ranges okay
[^]	Matches things not within the list
*	Matches 0 or more of the pattern to the immediate left
{x,y}	Matches between x and y of the pattern to the immediate left, inclusive
(x)	Save a match of pattern x to a register, also known as "marking" an expression.
\n	register #n (there are registers 1-9)
&	The matched string. In other words, the literal string that mached the pattern you supplied
<word>	the word, not any substring word

Examples of Regular Expressions

^#.* any line beginning with a #
datafile[0-9]+\.txt datafileN.txt, datafileNN.txt, &c
-{0,1}[0-9]+ a positive or negative integer number
-{0,1}[0-9]*\.[0-9]+ a positive or negative floating point number
^Greg Greg, but only at the beginning of a line
^Greg$ Greg, but only if it is the only thing on the line
^$.*$|.*|\1$ Assume that this is a database with a |-pipe as the field separator. This matches if the first and last field are the same

Non-standard Features, Irregular Expressions

Not all regular expression implementations are equal. For example, grep does not accept {ranges}, or allow the recall of marked expressions. The reason for this is that these features, although long-time features of varous "regular expressions" aren't regular.
"Regular expressions" are so-known because any language recognizable by a true regular expression is a regular language. In other words, if a language is recognizable by a regular expression, it should also be recognizable by a FSM, whether a DFA or an NFA. Ranges can be regular -- if they havea finite range. But, the marking and recall of regions is entirely incompatible with a regular language. DFAs and NFAs are equivalently powerful -- and have no memory beyond their states and transitions. There is no way of remembering an expression for use later. This would only be possible if there existed a unique state for each remembered expression. And, that isn't possible if expressions can have an infinite length -- an infinite number of states would be required.
The practical consequence of this is that "regular" expressions that implement this feature aren't regular and need to be implemented with a technique such as backtracing instead of through an DFA or NDA. The consequence is that the complexity goes through the roof -- and the running time can go down the drain.
As a consequence, some implementations of "regular expressions" leave out this type of feature in order to remain regular and protect the recognizer and bound on the running time.

grep

grep is the standard tool for searching a text file for a substring that matches a particular pattern. Often times this pattern is a literal, such as a word. But, on other occasions, it can be something defined using a regular expression.
Grep assumes the input files are text files, rather than binary files. It assumes that the file is organized into one or more "lines" and reports lines that contain matches for the provided regular expression
Please check the man pages for the details of grep, but the basics are as follows:
  grep [flags] [pattern] [file_list]
  
The pattern can be any truly regular expression. The file_list is a space-separated list of the files to search. This list can include things defined using wildcards. The flags control the details of the way grep works. My favorite flags include the following:

-n: include the line number in the output
-v: show lines without a match instead of those with a match
-i: case in-sensitive comparisons

the following is an example of a grep:
  grep -n '[Gg]reg' *.txt people/*.txt
  

sed

sed, the stream editor is another one of my favorite tools. Its history is somewhat interesting. Back in "the day", unix's editor was a simple line-editor affectionately known as ed. It was designed for a teletype and, as a consequence, displayed only one line at a time. But, to make things easier, it could do fairly sophisticated edits on that line, including searches and replaces, &c.
Eventually, with the advent of CRT terminals, it was extended into the visual editor that we all known and love. sed, the stream editor, is another onf of the original ed's offspring.
When it comes right down to it, sed is a programmable filter. Text is piped into it's stdin, gets filtered, and gets pumped back out via stdout in its mutated form. The programming is done via regular expressions. Since sed is a filter, its normal behavior is to output everything that it is given as input, making any changes that it needs to make along the way.
Perhaps the most common use of sed is to perform some type of search and replace. Here's an example. It will change every instance of the |-pipe into a ,-comma. This might be used, for example, to change the delimiter within some database "flat file".
  cat somefile.txt | sed 's/|/,/g' > outfile
  
Notice that the input is piped into sed. Notice the output is captured from sed's standard out. It can also be piped into another process. Now, let's take a look at the program, itself: 's/|/,/g'.
The leading "s" is the command: substitute. A substitute consists of the "s" command, the pattern to be found, the replacement, and the flag(s). These three parameters are seaprated by a delimiter. This delimited, by convention, is usually a /-slash. But, it can be anything. Regardless, the character immediately after the s-command is the delimiter. One might change it, for example, if one wants to use the delimiter as part of the pattern. I find myself doing this, for example, if my pattern is going to include directory names. A that point, I might use a |-pipe or a #-pound. For example, the following is equivalent but uses a #-pound as the delimter:
  cat somefile.txt | sed 's#|#,#g' > outfile
  
Having explained that, let em also mention that the delimer can always be escaped, using a \-forward_slash, and used as a literal within the pattern or replacement.
The pattern can be a regular expression, and the replacement is usually the literal replacement. The flag is usually a "l" for local or a "g" for global. A local replacement replaces only the first match within each line. A global replacement replaces each and every match within a lines. A specific number can also be used to replace up to, and including, that many matches -- but not more. At this point, let's remember that sed is designed to work on text files. It has a routine understnading of a line.
Said can also be directed to pay attention to only certain lines. These lines can be restricted by number, or to those that match a particular pattern. The following only affects lines 1-10. Notice the restriction before the substitute.
  cat somefile.txt | sed '1,10 s/|/,/g' > outfile
  
The pattern below would affect lines 10 through the end:
  cat somefile.txt | sed '1,$ s/|/,/g' > outfile
  
The example below will operate only on those lines beginning with a number. Notice the pattern is contained within //'s:
  cat somefile.txt | sed '/^[0-9]+/ s/|/,/g' > outfile
  
Another one of my favorite uses of sed is to generate a more powerful grep. Remember, most greps work only with truly regular expressions. And, remember that most greps can't use {ranges}. Consider the example below where sed is used as a more powerful grep. It prints any line that begins and ends with the same 1-3 digit number:
  cat somefile.txt | sed -n '/^$[0-9]\{1,3\}$.*\1$/ p' > outfile
  
Okay. So, let's decode the example above. Recall that said is normally a pass-through filter. Everything that comes it goes out, perhaps with some changes. The "-n" filter tells sed that it should not print. This, under normal circumstances, makes it quiet. But, in this case, we are using the "p" command to tell sed to print.
So, now we see the whole magic. First, we tell sed to be quiet by default. Then, we tell sed to print the selected lines. We make the line selection as we did above, by specifiying a pattern.
While we are talking about the power of regular expressions within sed, let me mention one of its features: the &-ampersand. When used on the right-hand side of a substitute, the &-ampersand represents whatever was actually matched. Consider the following example:
  cat somefile.txt | sed 's/[0-9][0-9]*[+\-\*\/][0-9][0-9]*/(&)/g' > outfile
  
sed has many more features that we're not discussing. "man sed" for more details.

cut

cut is a quick and dirty utility that comes in handy across all sorts of scripting. It selects one portion of a string. The portion can be determined by some range of bytes, some range of characters, or using some delimiter-field_list pair.
The exmaple below prints the first three characters (-c) of each line within the file:
  cat file | cut -c1-3
  
The next example uses a :-colon as a field delimiter and prints the third and fifth fields within each line. In this resepect lines are treates as records:
  cat file | cut -d: -f3,5
  
In general, the ranges can be expressed as a single number, a comma-separated list of numbers, or a range using a hyphen (-).

The tr command translates certain characters within a file into certain other characters. It actually works with bytes within a binary file as well as characters within a text file.
tr accepts as arguments two quoted strings of equal length. It translates characters within the first quoted string into the corresponding character within the next quoted string. The example below converts a few lowercase letters into uppercase:
  cat file.txt | tr "abcd" "ABCD" > outfile.txt
  
tr can also accpt ranges, as below:
  cat file.txt | tr "a-z" "A-Z" > outfile.txt
  
Special characters can be represented by escaping their octal value. For example, '\r' can be represented as '\015' and '\n' as '\012'. "man ascii" if you'd like to see the character-to-number mapping.
The "-d" option can be used to delete, outright, each and every instance of a particular character. The example below, for example, removes '\r' carriage-returns from a file:
  cat file.txt | tr -d "\015" > outfile.txt
  

A Note On Input and Output Files

Often times in the process of shell scripting we want to mutate a file in place. In other words, when piping, we want the input and output files to be the same file. The temptation is to write code as below:
  cat somefile.txt | tr "\015" "\012" > somefile.txt
  
Please let me assure you that no good can come from this. Although, depending on exactly how the pip[eline executes, it is possible that it might miraculously work, this isn't the likely case. To much badness can happen.
For example, if the output redirection is as above, the redirection will likely truncate the input file before it gets read by the first process in the pipeline. Another possibility is that it gets partially read. If the output redirection is an append (>>), the pipeline might never end. All sorts of corruption is possible.
Instead, save the output to a temporary file and then move the temporary file on top of the original. A first attempt at this might be as follows:
  #!/bin/sh

  cat ${1} | tr "\015" "\012" > tempfile.txt
  mv tempfile.txt ${1}
  
Although better, the example above is still not safe. Consider what could happen if multiple instances of the script are run at the same time. It is possible that the first will write to the file, then it will get over-written by the second. Then, the first will move it over its original file. Then the second will fail, because the file is now gone. This leaves the second script to run without any changes and the first with the results of the second's run. And, this is only one variation of the possible interference-related corruption.
To solve this problem, we need a unique filename for each temporary file. One quick-and-dirty way of achieving this is to append $$ to the file name. Recall that $$ is the processID of the shell running the script. Since PIDs are unique at any point in time, this will ensure no two instances of the script end up using the same temporary file.
  #!/bin/sh

  cat ${1} | tr "\015" "\012" > ${1}.$$
  mv ${1}.$$ ${1}
  
Sometimes it is desireable to put these temporary fiels into /tmp or /usr/tmp. These directories are grabage collected upon a reboot.
  #!/bin/sh

  cat ${1} | tr "\015" "\012" > /usr/tmp/${1}.$$
  mv /usr/tmp/${1}.$$ ${1}
  
During class there was a brief sidebar about the possibility of exploiting temporary files within /usr/tmp or /usr/. The basic concern is that, since these directories are writeable by anyone, an attacker can possibly prevent the script from writing the file by creating oen fo the saem name w/out write permissions. Then, it could re-install write permissions, enabling the move. By doing this, it can replace the original data with bogus data. Other attacks of a similar nature are also possible. To guard against these, one can choose a temporary name randomly, manage file system permission more carefully, and/or pre-flight the files rigorously.
But, I want to emphasize that this type of concern is beyond the scope of this course. Our focus is in using shell scripts as tools to help with our own development -- not in deploying production grade tools for deployment in hostile enviornments.