October 2, 2007 (Lecture 7)

October 7, 2007(Lecture 7)

Project #4 Assigned

Project #4 was assigned today. Check under the "labs" section for the lab, itself, and other related materials.

The Task

A task is an "instance of a program in execution". What does this mean? Basically, it means that a task is a running program. A task is created by reading a program's image from a file into memory, performing appropriate initialization, and making it known to the operating system's scheduler.

Representing a Task in Software

How do we represent a task within the context of an operating system? We build a data structure, sometimes known as a task_struct or (for processes) a Process Control Block (PCB) that contains all of the information our OS needs about the state of the task. This includes, among many other things:

content of registers (like the PC)
content of the stack
memory pages/segments
open files

When a context switch occurs, it is this information that needs to be saved and restored to change the executing process.

A Task's Memory

Let's consider how a tasks's memory is organized. In particular, let's look at how the code and data are organized in the virtual memory.
Please note that the heap grows upward through dynamic allocation (like malloc) and the stack grows downward as stack frames are added throguh function calls. Such things as return addresses, return values, parameters, local variables, and other state are stored in the runtime stack.

The text area holds the program code.
The data area holds global and static variables that must be stored in the executible, since they are initialized.
The bss area holds variables that are unitialized in the sense that their need not be persistently stored on disk -- they can be plugged in later.

Task State

Just like people, tasks have lifestyles. They aren't always running and they don't live forever. Typical UNIX systems view tasks as existing in one of several states:

New: Recently created
Ready: Ready to run, but not yet assigned to a processor
Running: Actually executing on a processor
Waiting: Not currently executing. Not currently able to execute. Won't be runnable until some specific external event occurs, such as an I/O operation.
Terminated: Done executing, won't accomplish anything else useful or again be placed on a CPU.

A task moves from the new state to ready state after it is created. Once this happens, we say that the task is "admitted."
After the scheduler selects a task and assigns it to a processor, we say that the task has been "dispatched."
When a task is done, it "exits." It is then in the terminated state.
If a task is waiting for an event, such as a disk read to complete, it can "block" itself yielding the CPU. It is then in the "wait" state. The system has many different wait queues -- not one universal wait queue -- in fact, there is one wiat queue for each possible reason to wait. This is because it would be very expensive to sift through a long list each time a resource became available or other event occured. It is not a case of needing a list of lists, either -- since each list is associated with the event, it doesn't require any searching -- if we take care of the queue when we handle the event, we're already in the right place.
After the event occurs, the operating system can move it to the "ready" state.
After a task has exhausted its time slice, it can be moved into the ready state to allow another task access to the processor.
Please pay careful attention. The operating system is responsible for creating tasks, dispatching them, readying them after an event, and interrupting them after their time expires. Tasks must exit and block voluntarily.

Creating New Tasks

One of the functions of the operating system is to provide a mechanism for existing tasks to create new tasks. When this happens, we call the original task the parent. The new task is called the child. It is possible for one task to have many children. In fact, even the children can have children.
In UNIX, child tasks can either share resources with the parent or obtain new resources. But existing resources are not partitioned.
In UNIX when a new task is created, the child is a clone of the parent. The new task can either continue to execute with a copy of the parent image, or load another image. Well talk more about this soon, when we talk about the fork() and exec-family() of calls.
After a new task is created, the parent may either wait for the child to finish or continue and execute concurrently (real or imaginary) with the child.

Task Termination

A child may end as the result of the normal completion, it may be terminated by the operating system for "breaking the rules", or it might be killed by the parent. Often times parents will kill their children before they themselves exit, or when their function is no longer required.
In UNIX, when a task terminates, it enters the defunct state. It remains in this state until the parent recognizes the fact that it has ended via the wait-family() of calls. Although a defuct task has given up most of its resources, much of the state information is preserved so that the parent can find out the circumstances of the child's death.
In UNIX children can outlive their parents. When this happens, there is a small complication. The parent is not around to acknowlege the child's death. A dead process is known as a zombie if its parent has already died. The init process waits for all zombies, allowing for them to have a proper burial. Sometimes zobies are known as orphans.

Fork -- A traditional implementation

fork() is the system call that is used to create a new task on UNIX systems. In a traditional implementation, it creates a new task by making a nearly exact copy of the parent. Why nearly exact? Some things don't make sense to be duplicated exactly, the ID number, for example.
The fork() call returns the ID of the child process in the parent and 0 in the child. Other than this type of subtle differences, the two tasks are very much alike. Execution picks up at the same point in both.
If execution picks up at the same point in both, how can fork() return something different in each? The answer is very straightforward. The stack is duplicated and a different value is placed on top of each. (If you don't remeber what the stack is, don't worry, we'll talk about it soon -- just realize that the return value is different).
The difference in the return value of the fork() is very significant. Most programmers check the result of the fork in order to determine whether they are currently the child or parent. Very often the child and parent to very different things.

The Exec-family() of calls

Since the child will often serve a very different purpose that its parent, it is often useful to replace the child's memory space, that was cloned form the parent, with that of another program. By replace, I am referring to the following process:

Deallocate the process' memory space (memory pages, stack, etc).
Allocate new resources
Fill these resources with the state of a new process.
(Some of the parent's state is preserved, the group id, interrupt mask, and a few other items.)

Memory Mapping and Modern Execs

In traditional implementations of exec(), each time a process was created a new image was loaded from disk. In modern operating systems, it is possible, and even the norm,, and even the normo for several instances of the same program to share the static components, such as code and constants, while each retains its own copy of data and runtime state. The basic idea is that the same physical frames are mapped into each of the tasks' address spaces.

Fork w/copy-on-write

Copying all of the pages of memory associated with a process is a very expensive thing to do. It is even more expensive considering that very often the first act of the child is to deallocate this recently created space.
One alternative to a traditional fork implementation is called copy-on-write. the details of this mechanism won't be completely clear until we study memory management, but we can get the flavor now.
The basic idea is that we mark all of the parent's memory pages as read-only, instead of duplicating them. If either the parent or any child try to write to one of these read-only pages, a page-fault occurs. At this point, a new copy of the page is created for the writing process. This adds some overhead to page accesses, but saves us the cost of unnecessarly copying pages.

vfork()

Another alternative is also available -- vfork(). vfork is even faster, but can also be dangerous in the worng hands. With vfork(), we do not duplicate or mark the parent's pages, we simply loan them, and the stack frame to the child process. During this time, the parent remains blocked (it can't use the pages). The dangerous part is this: any changes the child makes will be seen by the aprent process.
vfork() is most useful when it is immediately followed by an exec_(). This is because an exec() will create a completely new process-space, anyway. There is no reason to create a new task space for the child, just to have it throw it away as part of an exec(). Instead, we can loan it the parent's space long enough for it to get started (exec'd).
Although there are several (4) different functions in the exec-family, the only difference is the way they are parameterizes; under-the-hood, they all work identically (and are often one).
After a new task is created, the parent will often want to wait for it (and any siblings) to finish. We discussed the defunct and zombie states last class. The wait-family of calls is used for this purpose.

A Quick Overview of the File System from the OS Point of View

The operating system maintains two data structures representing the state of open files: the per-process file descriptor table and the system-wide open file table.
When a process calls open(), a new entry is created in the open file table. A pointer to this entry is stored in the process's file descriptor table. The file descriptor table is a simple array of pointers into the open file table. We call the index into the file descriptor table a file descriptor. It is this file descriptor that is returned by open(). When a process accesses a file, it uses the file descriptor to index into the file descriptor table and locate the corresponding entry in the open file table.
The open file table contains several pieces of information about each file:

the current offset (the next position to be accessed in the file)
a reference count (we'll explain below in the section about fork())
the file mode (permissions),
the flags passed into the open() (read-only, write-only, create, &c),
a pointer to an in-RAM version of the inode (a slightly light-weight version of the inode for each open file is kept in RAM -- others are on disk), and a structure that contains pointers to all of the .
A pointer to the structure containing pointers to the functions that implement the behaviors like read(), write(), close(), lseek(), &c on the file system that contains this file. This is the same structure we looked at last week when we discussed the file system interface to I/O devices.

Each entry in the open file table maintains its own read/write pointer for three important reasons:

Reads by one process don't affect the file position in another process

Write are visible to all processes, if the file pointer subsequently reaches the location of the write

The program doesn't have to supply this information each call.

One important note: In modern operating systems, the "open file table" is usually a doubly linked list, not a static table. This ensures that it is typically a reasonable size while capable of accomodating workloads that use massive numbers of files.

Session Semantics

Consider the cost of many reads or writes may to one file.

Each operation could require pathname resolution, protection checking, &c.

Implicit information, such as the current location (offset) into the file must be maintained,

Long term state must also be maintained, especially in light of the fact that several processes using the file might require different view.
Caches or buffers may need to be initialized

The solution is to amortize the cost of this overhead over many operations by viewing operations on a file as within a session. open() creates a session and returns a handle and close() ends the session and destroys the state. The overhead can be paid once and shared by all operations.

Consequences of Fork()ing

In the absence of fork(), there is a one-to-one mapping from the file descriptor table to the open file table. But fork introduces several complications, since the parent task's file descriptor table is cloned. In other words, the child process inherits all of the parent's file descriptors -- but new entries are not created in the system-wide open file table.
One interesting consequence of this is that reads and writes in one process can affect another process. If the parent reads or writes, it will move the offset pointer in the open file table entry -- this will affect the parent and all children. The same is of course true of operations performed by the children.
What happens when the parent or child closes a shared file descriptor?

remember that open file table entries contain a reference count.

this reference count is decremented by a close

the file's storage is not reclaimed as long as the reference count is non-zero indicating that an open file entry to it exists

once the reference count reaches zero, the storage can be reclaimed

i.e., "rm" may reduce the link count to 0, but the file hangs around until all "opens" are matched by "closes" on that file.

Why clone the file descriptors on fork()?

it is consistent with the notion of fork creating an exact copy of the parent

it allows the use of anonymous files by children. The never need to know the names of the files they are using -- in fact, the files may no longer have names.

The most common use of this involves the shell's implementation of I/O redirection (< and >). Remember doing this?

Memory-Mapped Files

Now that we've talked about how the file system usually maintains and access files, let me take the opportunity to point out that there is actually another way. It is actually possible to hand a file over to the VMM and ask it to manage it, as if it were backing store for virtual memory. If we do this, we only use the file system to set things up -- and then, only to name the file.
If we do this, when a page is accessed, a page fault will occur, and the page will be read into a physical frame. The access to the data in file is conducted as if it were an access to data in the backing-store. The contents of the file are then accessed via an address in virtual memory. The file can be viewed as an array of chars, ints, or any other primitive variable or struct.
Only those pages that are actually used are read into memory. The pages are cached in physical memory, so frequently accessed pages will not need to be read from external storage each access. It is important to realize that the placement and replacement of the pages of the file in physical memory competes with the pages form other memory mapped files and those from other virtual memory sources like program code, data, &c and is subject to the same placement/replacement scheme.
As is the case with virtual memory, changes are written upon page-out and unmodified pages do not require a page-out.
The system call to memory map a file is mmap(). It returns a pointer to the file. The pages of the file are faulted in as is the case with any other pages of memory. This call takes several parameters. See "man mmap" for the full details. But a simplified version is this:
void *mmap (int fd, int flags, int protection)

The file descriptor is associated with an already open file. In this way the filesystem does the work of locating the file. Flags specifies the usual type of stuff: executable, readable, writable, &c. Protection is something new.
Consider what happens if multiple processes are using a memory-mapped file. Can they both share the same page? What if one of them changes a page? Will each see it?
MAP_PRIVATE ensures that pages are duplicated on write, ensuring that the calling process cannot affect another process's view of the file.
MAP_SHARED does not force the duplication of dirty pages -- this implies that changes are visible to all processes.

A memory mapped file is unmapped upon a call to munmap(). This call destroys the memory mapping of a file, but it should still be closed using close() (Remember -- it was opened with open()). A simplified interface follows. See "man munmap" for the full details.
int munmap (void *address) // address was returned by mmap.

If we want to ensure that changes to a memory-mapped file have been committed to disk, instead of waiting for a page-out, we can call msync(). Again, this is a bit simplified -- there are a few options. You can see "man msync" for the details.
int msync (void *address)

Cost of Memory Mapped Access To Files

Memory mapping files reduces the cost of accessing files imposed by the need for traditional access to copy the data first from the device into system space and then from system space into user space.
But it does come at another, somewhat interesting cost. Since the file is being memory mapped into the VM space, it is competing with regular memory pages for frames. That is to say that, under sufficient memory pressure, access to a meory-mapped file can force the VMM to push a page of program text, data, or stack off to disk.
Now, let's consider the cost of a copy. Consider for example, this "quick and dirty" copy program:
int main (int argc, char *argv)
{
  int fd_source; 
  int fd_dest;
  struct stat info;
  unsigned char *data;

  fd_source = open (argv[1], O_RDONLY);
  fd_dest = open (argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0666);

  fstat (fd_source, &info);

  data = mmap (0, info.st_size, PROT_READ, MAP_SHARED, fd_source, 0);
  write (fd_dest, data, info.st_size);

  munmap (data, info.st_size);
  close (fd_source);
  close (fd_dest);
}
Notice that in copying the file, the file is viewed as a collection of pages and each page is mapped into the address space. As the write() writes the file, each page, individually, will be faulted into physical memory. Each page of the source file will only be accessed once. After that, the page won't be used again.
The unfortunate thing is that these pages can force pages that are likely to be used out of memory -- even, for example, the text area of the copy program. The observation is that memory mapping files is best for small files, or those (or parts) that will be frequently accessed.