Return to lecture notes index
January 5, 2008 (Lecture 5)

Overview

Over the last couple of classes, we've taken a good look at the sytax of the shell's language as well as how shell scripts are commonly use to solve problems. Today, we're going to take a look at some more challenging, more interesting concepts. Specifically, we are going to look at subshells, the nature of pipes, and writing shellscripts that can safely be run by multiple users at the same time.

Pipes as an Inter-Process Communication (IPC) Primitive

We've made pretty heavy use of pipes this semester. In a very real way, they are the glue that tie shell scripts together. But, interesting enough, pipes are tools for "Inter-Process Communication (IPC)". In other words, they connect two different processes together.

And, at the command line, this makes sense. Consider "ls | cat | more". If we launch this from the UNIX shell, we've got four processes in play: the shell, itself, ls, cat, and more. In order to launch each new command, the shell "forks" a new process and then "execs" the desired program within it. So, one pipe bridges "ls" and "cat" and s second pipe bridges "cat" and "more". And, based on our experience using the shell, this makes good sense.

But, how does this work within a shell script? Well, exactly the same way. The shell interpteter is running the shell script. But, each time the shell runs a command, it forks a new process and execs that command. So, each command is a different process and can be bridged using a pipe.

Pipes Can't Edit In Place

Often times I want to edit a file by applying filters using pipes. The following is an example -- illustrating a very common problem. It is intended to translate carriage-returns into line-feeds using tr and then wraps long lines using fold. And, depending on your use cases, it'll often seem to work -- but it is badly broken. Do you see the problem?

  cat somefile.txt | tr -d "\015" "\012" | fold > somefile.txt
  

The input file at the beginning of the pipeline and the output file at the end are the same file. If this file is small, and is read in its entirety before output begins to emerge from the fold at the end, this example will work fine. But, life becomes interesting when the fold emits outbut before the input file is fully read by cat. When fold goes to write the file, it truncats it and overwrites it from the beginning. So, as cat proceeds to read the file, it will either find it empty and end early or end up swallowing newly written content rather than the original data.

To understand the nature of the problem, we need to understand how a pipe works. It is a finite buffer -- a typical size might be 8k. If the input file is less than 8k, it is likely to be quickly read in and completely placed within the buffer. In this case, it probably gets there before the last command in the pipe begins producing output, causing the input file to be truncated.

But, think about what happens if the input file is larger than the buffer. If cat starts feeding the pipe and gets ahead of the processing in the pipeline, the buffer can become full. When this happens, the cat can't write into the full buffer, so the operating system temporarily pauses it, until it can. This temporary pausing is called blocking. In the meantime, the output might start flowing, truncating the original input file, making it impossible for the cat to get the rest of it.

In order to solve this problem, we always send the output to a temporary file. Then, once done, we move the temporary file back over the original. Consider the example below:

  cat ${1} | tr -d "\015" "\012" | fold > ${1}.tmp
  mv ${1}.tmp ${1}
  

The example above is a step in the right direction, but it isn't quite as we'd like. If two user's run the same program at the same time, it can run into problems. Both instances of the script will try to make use of the same temporary file --even though they are likely to be be at diffeernt points in the process and operating on different input files. The first one to start will own the file and get to make use of it -- the other one is likely to lose. It won't be able to write to the file. And might not be able to read it. So, at best it does nothing as it can niehter read nor write the input file. At worst, it takes the output of the other instance as its result.

To solve this problem, we make use of the $$ special variable. You'll recall that each instance of the shell script will have its own processid. As a result, the $$ acts as a great uniqifier -- it makes sure that each instance operates upon its own temporary file.

  cat ${1} | tr -d "\015" "\012" | fold > ${1}.tmp.$$
  mv ${1}.tmp.$$ ${1}
  

As a final, minor revison, we'll put the tmp file into the "/usr/tmp" directory. The "/usr/tmp" and "/tmp" directories are writeable by anyone. And, as an added bonus, these directories are cleared upon reboot. This way, if we forget to clean up our temp files, there is some hope that they'll eventually get thrown away. Also, by writing into this directory, we make the purpose of these files clear -- they are temporary files. I've also enclosed the file names within ""-quotes -- it makes them more robust in the event of unescaped spaces.

  cat "${1}" | tr -d "\015" "\012" | fold > "/usr/tmp/${1}.$$"
  mv "/usr/tmp/${1}.$$" "${1}"
  


Security-Minded Code

For those who would like to write industrial-strength code, you might want to look up the mktemp (man mktemp) command. Rather than using the PID from the $$-variable to uniqify a file name, it uses a random string. In the common case, this isn't important -- it is unique either way. But, it can make a difference if security is a concern.

Since PIDs are relatively small numbers, if the PID is used as a uniquifier, a hacker can create all possible temp files in advance of the script's execution. If this happens, the hacker can set the permissions on the temporary files to prevent the script from writing to them. Or, even worse, to make the temporary files world-readable, enabling the hacker, and others, to access otherwise protected data. For this reason, if you are interested in security, it is also always a good idea to verify the mode of the temporary files, or at least their prior non-existance, within a script, before beginning to use them.

Also, since the PIDs on a system increase and then roll over as processes are created, they are very predictable. This makes it possible to launch a high-precision attack by creating fewer temp files.

Regardless, this is mostly FYI -- security issues are beyond the scope of this course. But, you'll get there, I promise!


Pipes, Loops, and Subshells

Consider the example below. Notice the pipe between the cat and the while loop:

  #!/bin/sh

  FILE=${1}

  cat ${FILE} |
  while read value
  do
    echo ${value}
  done
  

This is a bit curious. A pipe can only connect two different processes. The while loop is interpreted by the shell. How does this work? Where is the second process?

The answer is that the shell creates a subshell for the while loop. A subshell is a separate shell spawned by the primary shell. The loop is run within the subshell rather than within the primary shell. So, the pipe connects the primary shell's "cat" with the subshell's "read". Voila, we have genuine IPC within a shell script.

Danger, Danger! Hidden Consequences

Check out the following example. Does it work as apparently intended? What do you think?

  #!/bin/sh

  FILE=${1}
  max=0

  cat ${FILE} |
  while read value
  do
    if [ ${value} -gt ${max} ]; then
      max=${value}
    fi
  done

  echo ${max}
  

Well, we all know the rules of classroom-style "Correct or Not?" It surely isn't correct. But, what is the problem? The logic seems fine, but when we run it, it acts weird. At the end, "max" is still 0. But, if we inspect it, such as by an echo, within the loop, it is incrementing correctly.

The problem here has to do with subshells. We already know that a new subshell needs to be created for the while loop. So, what does that mean about the "max" variable? How does it get into the subshell? Well, the new subshell gets a clone of most of the old subshell's variables. So, we've actually got two "max" variables -- one in the original shell and one in the loop's subshell. The loop increment's its own copy of "max" -- leaving the "max" within the original shell unchanged.

Okay. Is what is the fix? Well, I've been looking for one for years. And, to be honest, I still don't know a good way of making this subshell "pass by in-out", such that the results get copied back. Instead, we are going to try a bit of a paradigm shift.

Instead of piping the results into the loop, we are going to capture them into a variable, and then iterate through the resulting list. Notice that, in the example below, there is no pipe and, as a result, no resulting subshell to cause problems. No variables are shadowed. This "capture and loop" idiom is very common in shell scripting.

  #!/bin/sh

  FILE=${1}
  max=0

  values=`cat ${FILE}`

  for value in ${values}
  do
    if [ ${value} -gt ${max} ]; then
      max=${value}
    fi
  done

  echo ${max}

  

Creating Your Own Subshells

In the past, I've mentioned that ()-parenthesis need to be escaped if they are to be used literally within a shell script. But, I have never explained their syntactical meaning. When commands are placed within ()-parenthesis, we are asking, explicitly, for them to be placed into their own subshell.

One might choose to place commands into a subshell for any number of reasons. Perhaps the most common reason is to isolate environmental changes to only the small set of commands within the subshell. Consider the following example:

  #!/bin/sh

  pwd
  ( cd src; make 2> errors.log ) 
  pwd
  
The above example makes use of the make utility, which is a tool to manage compilation and other aspects of building software. We'll talk about it soon enough. But, for now, the interesting part of the example is the use of the subshell.

Notice that the working directory is changed within the subshell -- and that it does not effect the working directory of the primary shell. This means that, once the subshell is done, our shell script is in the same directory as it started -- and it did not need to remember the directory in order to move back.

In a multiprocessor system, we might want to compile multiple parts of the same project at the same time. We can do this by doing the compilation within a background subshell, as above, and stacking them up. Consider the example below. Notice that we don't wait for the first one to finish before starting the next, and so on. If we've got multiple processors, or even multiple cores, this will let us make use of more of our processing power at the same time.

  #!/bin/sh

  pwd
  ( cd src/tools; make 2> errors.log ) &
  ( cd src/server; make 2> errors.log ) &
  ( cd src/client; make 2> errors.log ) &
  pwd
  

With this in mind, if we look back at the examples above, we see that the ()-parenthesis help us to group multiple commands into the same subshell. Without the parenthesis, we'd still have a subshell, but it would only include the make -- the cd would be part of the primary shell. The ;-semi colon just separates commands -- it does not create a subshell.

To be clear, the following script runs the cds within the primary subshell, but runs the builds within individual subshells. Notice the need to manage the working directory ourself:

  #!/bin/sh

  pwd

  cd src/tools; make 2> errors.log &
  cd ../src/server; make 2> errors.log &
  cd ../src/client; make 2> errors.log &

  cd ../..
  pwd
  

Background Execution Requires a Subshell

It is also worth noting that any time a command is run in the background, it is run in a new subshell. This is the only way that the command can be left to run without the rest of the script waiting for it to finish. This includes loops, as shown in the example below:

  for machine in ${machines} 
  do
    rsync -avSHz -e ssh Updates/* root@${machine}:/Updates/
    ssh root@${machine} /usr/local/scripts/installupdates.sh
  done &
  

A Brush With Concurrency

When we talk about concurrency, we are talking about the situation where multiple related things are occuring at the same time. Often times the concurrency involves critical resources. If you take operating systems or databases, you'll likely learn a very formal definition of a critical resource. But, for our purposes, a critical resource is something that is shared -- but cannot safely ne used in an unrestricted way by more than one process.

Earlier today, when we discussed the use of the $$-special variable to create unique temp file names, the situation we were in looked a whole like a concurrency problem. We had multiple instances of the script sharing the same temporary file -- and making a mess of it. But, as our solution showed, it was really false concurrency. There was no reason for the two instances to share the same file. So, instead of finding a way to make the concurrent use safe, we elminated the false concurrency by giving each process its own unique file.

But, now, I'd like to consider a genuine concurrency problem and figure out the characteristics of a solution and develop an idiom we can use to solve similar problems.

Real Concurrency: A CGI Example

Last semester, in 15-123, we gave on-line exams. Students were asked to sign up for their choice of time slots. In order to make the process straight-forward, we wanted to do this on-line, so late one night, I hacked together a quick "registration script" to let students sign up on the Web.

It had a plain-text file for each exam date. Each time a student signed up for an exam, the student's andrewid was appended to the end of the file representing the selected exam session. So, in this way, the roster for an exam session was nothing more than a plain-text file containing the andrewids of those registered for that session.

The script registered a student for a session by appending the student's andrewid to the bottom of the file, as below:

  echo ${USER} >> ${SELECTION_FILE}
  

Each time a student changed a reservation, the andrewid needed to be removed from the original file before it could be added to the new one. I did this by using grep. The trick was to use the -v flag to ask grep to invert the match -- in other words, to report only non-matching lines. I also added the -w flag so that only whole words would be matched -- to prevent substrings from matching, for example, "john" from matching "johnson". Lastly, I used the -h flag so that grep only echoed the non-matching lines, not the file name. The result was that grep processed an input file to produce an output file that was an exact copy -- less the matching userid that was to be removed:

  grep -wvh ${USER} data/${file} >  /tmp/${file}.$$
  rm -f data/${file}
  mv /tmp/${file}.$$ data/${file}
  

Concurrency Problems

The system described above works well -- at least unless two students try to use the system at the same time. Consider, for example, what happens if there are two simultaneous resections, or a resection and an add. In either case, the script needs to remove a user from a roster file.

Recall that it will do this by creating a per-instance temp file that contains the roster file with the resectioning user removed. It then copies this file back over the original. Now, imagine this, I read the file and go about the process of creating a temp file without my entry. You append yourself to the original file. What happens to your registration when my temp file gets copied back over the registration file?

Well, it gets nuked. We call this the lost update problem. The problem here is that neither instance has a completely up-to-date copy. The registering user has a copy with the new registration -- but that continues to have the registration of the deregistering user. The deregistering user has a copy with the deregistering userid removed -- but without the new add. There can be no winner. We would get into a similar problem with two removes, though two adds are safe.

Unlike the false concurrency problem we saw earlier -- this is genuine, natural concurrency. The instances of the script need to share the same roster file -- how else are all of the registrations going to land in one place?

In class, some folks offered suggestions about designing the registration system in various ways that might escape the concurrency problem. And, some (though not most) of them might be doable -- that's not the point. There might be other ways of designing a registration system -- but this is a reasonable one. And, so, we'll solve the concurrency problem rather than trying to find away to get away from it. Unlike the earlier situation with the file names, this problem isn't accidental. It occurs because we choose to keep all of the userids for a particular session in the same dense text file -- a reasonable design decision that is convenient for the consumers of the text files.

Solving Concurency Problems

In order to solve this problem we need to get processes to wait if another process is editing the file. Our approach, at a high level will work like an "in use" signs on an office conference room. When approaching a conference room, if the "in use" sign is up, we wait. If not, we hang the sign and enter. In the context of software, this is a classic spin lock based solution. the waiting process often "spins" in a loop unitl it can continue.

We need to hang the "in use" sign somewhere where it can be seen by other processes. Since all of the processes have access to the file system, we'll hang the sign there. We'll create a lockfile. If the lockfile is present, we know that the conference room, or in our case, the roster file, is locked and we'll wait. Otherwise, we'll hang the sign and enter.

But, in practice, this is more ticklish than it seems. Imagine this scenario. Two instance of the programs concurrently look for the agreed lockfile. Neither sees it -- so each independently decides to create it and move on:

  until [ ! -f ${LOCKFILE} ]
  do
    sleep 2
  done

  touch ${LOCKFILE}

  # Code to manipulate files here

  rm ${LOCKFILE} # to release lock
  

Do you see the problem? Each process reaches the if-statement and decides that it is safe to continue. Each then exits the spin-lock loop and attepts to create the lock file to ward away the other. But, sadly, it is too late -- we already have an incursion. Both processes have passed the barrier and gotten to continue.

The only way we can solve this problem is to find a way to atomically test adn create the lock file. In other words, we need to be able to check the state of the lock file and create a new lock file in one motion, without the possibility of being interrupted. Processors include special instructions to do this at a finer grain. They are often called test-and-set or compare-and-swap. But, both share the same common goal -- testing, and setting, the state of shared space.

In our case, we need to do something with the file system that can simultaneously tell us "yes" and eclude all others. Let's consider a mkdir operation. mkdir creates a new directory.

If the directory already exists, it returns failure, otherwise it returns success. So, we can, for example, try to create a new directory. If we fail, we know that someone else is created it and is presently using the critical resource, so we spin. If we succeed, we know that it didn't previously exist -- but that it does now. But, take careful note -- by virtue of the return value, we were able to simultaneously check the state of the file and set it. If the file didn't previously exist, we are able to discover than, adn to create it, without the possibility of interruption. Since we can't get interrupted, our concurrency problem is gone. So, the directory created by mkdir can safely serve as our "in use" sign. This solution reads as follows:

  until [ mkdir ${LOCKDIR} 2> /dev/null ]
  do
    sleep 2
  done

  # Access the roster files
 
  rm -rf ${LOCKDIR} # release the lock
  

This solution is good. But, for the purpose of having some fun, and of giving you another example, I'd like to do it one other way. This technique will demonstrate the use of a subshell and also highlight some interesting, but subtle, details:

  until (umask 222; echo $$ >${TMP_DIR}/${LOCK_NAME}) 2>/dev/null   
  do
    # Wait so we don't burn to much CPU
    sleep 2
  done

  # edit roster files

  rm -f ${TMP_DIR}/${LOCKNAME} # Release the lock

  

Let's focus in on the atomic operation -- the predicate of the while loop:

  (umask 222; echo $$ >${TMP_DIR}/${LOCK_NAME})
  

Note that the value of a subshell is the return code of the last of the processes to execute within the subshell -- in this case, the echo that is redirected to the output file. This echo will succeed if, and only if, it can be redirected to the output file. If the output file can't be created or written, the echo will fail.

And, that's the beauty of this particular approach. Notice the "umask 222" at the beginning. UNIX has what is called the "default file creation mask.". It is normally "033". In other words, "--x-wx-wx" is subtracted from "rwxrwxrwx" such that newly created files have an initial mode of "rw-r--r--". The umask seen at the beginning of this subshell further subtracts "-w--w--w-" from this mask, ensuring only "r--r--r--". Notice that not even the woner can write.

This is interesting because a file's mode doesn't apply until after the file is created. So, we are free to create this lock file -- but, once we do, we can no longer write to it. So, let's take a second look at that line of code:

  (umask 222; echo $$ >${TMP_DIR}/${LOCK_NAME})
  

If the file doesn't exist, we can create it and write ourt PID into it. But, if it does exist, the write will fail. This approach is just as good as the mkdir approach that we saw earlier. But, it has the additional benefit of let us storing the id of the process that obtained the lock.

So, at this point, we can solve the concurrency-safe lock by using either of these techniques. But, what if a process dies after obtaining the lock and before releasing it? We can solve this problem in two ways.

The easiest way is to make use of a little-known shell command called trap. A trap asks the shell to execute a particular command in the event of any of several events, such as it being killed by a CTRL-C. If the condition is listed as "0", it will execute the requested command if it ends for any reason:

  trap "if [ grep -w $$ ${TMP_DIR}/${LOCK_NAME} ]; then \
           rm -f ${TMP_DIR}/${LOCK_NAME} \
        fi" 0
  until (umask 222; echo $$ >${TMP_DIR}/$LOCK_NAME) 2>/dev/null   
  do
    # Wait so we don't burn to much CPU
    sleep 2
  done

  # Play with roster files

  # Do *NOT* remove lock file -- it is being done by the trap
  

Notice the detail in the way we handled the trap. We removed the lock file if, and only if, it contains our own PID. We did this just in case we get intruuped by a CTRL-C before we create, or even test for this lock file. Without this test, we might nuke a lock file created by another instance of the script. It is the ability to manage this somewhat ticklish situation that makes this version of the script better than the simpler looking mkdir version. In that case, we had a directory with a well-known name -- and no way of recording our PID for use in this special case. We could add the PID as a file within the lock directory -- but we can't do that atomically. As a result, we could find ourselves needing the information before it has been recorded, but after the lock directory has been created.

As a second defense, just in case a lock-holding instance dies and somehow the lockfile isn't removed, we'll clean up stale lock files. If a lock file is really old, we'll cloean it up. First, we'll kill the offending process. Then, we'll remove the lock file. We find stale lock files using a command called find. It finds files that match certain criteria -- do a man find to learn about it.

The code below looks for a stale lock file. If it finds one, it looks inside for the PID and kills the process. Then, it removes the stale lock file. Note that this is a bit dangerous. There is no guarantee that process that got stuck left things in a usable form. But, what to do?

  # Obtain mutex
  trap "if [ grep -w $$ ${TMP_DIR}/${LOCK_NAME} ]; then \
           rm -f ${TMP_DIR}/${LOCK_NAME} \
        fi" 0
  until (umask 222; echo $$ >${TMP_DIR}/$LOCK_NAME) 2>/dev/null   # test & set
  do
    # Wait so we don't burn to much CPU
    sleep 2

    # Clear out the lock file, if it became stale
    stalefile=`find ${TMP_DIR} -name ${LOCK_NAME} -cmin +${TIMEOUT} -print`
    if [ "${stalefile}" != "" ]; then
      pid=`cat ${stalefile}`
      kill -9 ${pid}
      rm -f ${TMP_DIR}/${LOCK_NAME}
    fi
  done

  # Play with roster files

  # Do NOT remove lock file. This is done by the trap
  

The Signup Scripts

In class, we walked through the two scripts used for last semester's exams sign ups. index.cgi is fairly uninteresting. It mostly spits out a bunch of HTML.

About the only thing worth seeing there is the way it uses cut to translate the roster file name into the final exam date and time. It isolates the file name from the full path by using "/" as a delimiter and selecting the right field. Then it removes the extension by using "." as a delimter and selecting the first field. Lastly, it uses a tr to turn the underscores into spaces, represented by the octal value "\040" (man ascii, to see why):

  CURRENT_EXAM=`grep -l ${CURRENT_USER} data/*.txt | cut -d/ -f2 | cut -d. -f1 | tr "_" "\040"`
  

The more interesting script is select.cgi. It is invoked when the user chooses to reggister or change a registration. It does all of the cool locking and roster file manipulation that we've been discussing.

On the Exam?

Pipes and subshells will be covered on the exam. This includes the proper technique for using temporary files. Please wrap your brain around this portion of today's lecture prior to the exam. If we can help, please let us know -- that's why we're here.

What about concurrency and locking? Well, yes and no. You don't yet have enough experience with this for me to put it on the upcoming exam, at least in the form of a complex coding question. But, we'll get there. And, you should understand it -- because it reinforces a lot of what we've learned. So, please do make sure that, knowing what it does, and that you can walk through and explain it. But, you won't need, at this time, to recreate this from scratch.