Andrew Software Management

1.0 Why?

We're all busy people. Why do any of us want to spend time learning our esoteric software management system? The goal is to minimize the amount of time we spend working on individual systems. Rewashing software for the same platform again and again (or manually installing the software over and over) takes a lot of effort, allows for more mistakes to creep in, and all around is just a bummer. We simply don't have enough people to do individual management of all of our unix machines.

The Andrew environment is designed to allow all of us to share the grunge work of making systems work, without preventing per-machine flexibility. The downside is that it has a learning curve all its own—but one which you will hopefully find well worth it.

2.0 What is this junk?

depot is primarily responsible for managing collection versioning. depot takes over the management of a directory hierarchy (in our case, depot manages /usr/local, /usr/contributed, /usr/ng and /usr/host). No changes happen inside of this hierarchy without depot making them, ensuring that changes are reversible and reproducible. depot works by linking or copying various collections to the target directory and ensuring that these collections don't conflict. Individual collections can then be upgraded or installed and each file belongs to one and only one collection. depot by itself understands very little of versioning or per-machine customization; we use dpp, the depot pre-processor, for per-machine customization.

package is responsible for management of the operating system and other boot time configuration. It is the primary method by which we customize individual or classes of machines and keep them consistent over time. package itself is very stupid; it merely knows how to make a filesystem resemble its configuration file. package is also usually the most irritating program on our system, since it will delete files that don't match its configuration file—all of us have seen package delete something we wanted to keep. Like depot, package by itself is fairly stupid and doesn't allow any simple inheritance. We use yet another pre-processor, mpp, to provide these features for package. Along with our use of mpp, we use a large set of conventions to make our package environment comprehensible.

${wsadmin} or /afs/andrew.cmu.edu/wsadmin is the directory hierarchy on AFS which holds large numbers of fragments of package (and some depot) configuration files. The mpp processor knits these fragments together to form a complete package configuration. There are specific conventions we use to enable services and tweak options on machines. For example, to enable Apache on a machine all that is needed is %define doesapache in the /etc/package.proto. This line automatically includes the tens to hundreds of lines of package configuration Apache would normally need.

emt works with adm to perform delegated software management. emt manages a set of environments (a "beta" and a "gamma" environment for each operating system type (aka systype) and allows collection maintainers to release software to those environments. Since emt uses fairly long and annoying commands, the Perl script carpe generates the appropriate command to run after a simple interactive dialog and automatically e-mails it to a bboard (these bboards start with org.acs.asg.request). Individual maintainers can generally affect the beta environment directly. Gatekeepers are responsible for releases to the gamma environment.

3.0 How can we use the process to our advantage?

Most of these things are ideas on what we should be doing, not only (or not just) we're necessarily doing now.

3.1 Which environment do I use

/usr/local - anything with a command that an end-user may need to run.

/usr/contributed - anything not officially supported or not "system overhead."

/usr/ng - If you are in the network group and have something that is mainly to support the activities of the Network Group, it goes here. If you're unsure whether something belongs here or in /usr/local, put it in /usr/local.

/usr/host - Software that provides services but has no user runnable commands or libraries that people may want to link against for their own programs. Currently, machines by default don't have an actual /usr/host directory. This should be changed.

3.2 Where do files go? /afs/andrew/wsadmin? data/db?

/afs/andrew.cmu.edu/wsadmin/services - Things that one expects lots of other people to use.

/afs/andrew.cmu.edu/data/db/<a_service> - Things that one expects lots of other people to use.

/afs/andrew.cmu.edu/wsadmin/<your_service> - The specific instance of your service. Put specific server configurations, both package configuration and configuration files that package pulls in (e.g. inetd.conf, user.permits) in this directory hierarchy.

4.0 Specific examples

4.1 Major upgrades

4.2 Minor upgrades

4.3 Root disk crash

4.4 Emergency infrastructure fixes

4.5 bboard posts, release, upgrades, etc.

4.6 Emergency application fixes

5.0 Configuration files

5.1 Workstation configuration

Clusters are gamma machines.

Computing service desktop machines are generally beta machines. You might want to have /usr/local/depot/depot.pref.proto:

%define beta
%define tree local
%include /afs/andrew.cmu.edu/wsadmin/depot/src/depot.include


searchpath * ${local}

collection.installmethod copy lemacs,kerberos,com_err,gnucc,gdb

Add to the list depending on what applications you use frequently. (This is only for better performance.)

Your workstation will depot nightly. You can cause depot to use a specific version of a collection with a line like:

path cyrus ${dest}/cyrus/064

This will cause the Cyrus version 064 to by installed on your computer. This is useful for testing new versions before beta release to ensure proper functionality or examining how old versions worked.

You want to reboot whenever new OS versions are put into beta (see bboards); probably around once a month is a good choice, or after you run package. Always reboot after running package!.

5.2 Production servers

The primary question for production machines is "how often should they update"? The more frequent they update, the more times something may break—and frequent updates means that people are probably not paying close attention to each update. On the other hand, less frequent updates cause each update to be much bigger, which means tracking down what change caused a bustage can be much more complicated. Infrequent updates can also complicate security fixes—ideally, security fixes would require a very small software change but if a machine is too far behind the times, it will require a special version or a large update to stabilize.

If possible, production machines should reboot weekly, causing depot and package to run at each reboot. Generally, redundant services such as SMTP servers, Unix servers, or DNS servers should have no problems meeting this requirement, since they can reboot on a staggered schedule and cause little or no user visible outages. (Our users are remarkably tolerant of daily outages: the Unix servers are unavailable for 10-30 minutes every day with few complaints.) A single redundant server can be down for an extended period of time, so if an environment change has broken the server it is not a catastrophe.

TODO: We should make it easier to stagger reboots of 'identical' systems with a %define

Non-redundant servers need to balance the need for uptime versus the resources we want to spend as system administrators. While we've made some changes to package and depot to have them run faster, our server hardware tends to reboot slowly. Non-replicated file servers (such as Cyrus backends or AFS user servers) can cause interesting questions. Lately, we've rebooted AFS servers weekly (with little complaint) but have attempted to minimize the downtime for Cyrus backends. Non-replicated services can also suffer from the "unintended upgrade" effect: a seemingly unrelated change causes downtime, and causes downtime when no system administrator is immediately available to fix it. Possible remedies to this include:

Production servers may also want to have /etc/NoPackage and /etc/NoDepot created after the machine starts. This way, if the machine happens to crash hard during the day, recovery time is much faster. One must take care to remove these files on regular reboots to ensure updates happen.

TODO: We should make this a %define

It is discourged to have specify specific paths in the depot.pref.proto of your production servers as the default behavior. This is fine for early testing or to work around a specific bug but the goal should be to not have to specify specific versions of a collection. The reason is that unless you pay attention to the releases, versions may get deleted out from under you and dependency problems may sneak in and cause problems later.

It is also discouraged to use /afs/andrew/system/dest paths in package configuration files. If you need to reference a specific version of a collection, it is preferable (though still discouraged) to do so by specifying a version via the depot.pref.proto. Having specific versions referenced in package files makes it more difficult to upgrade systems as some software may not exist in the new @sys. Dealing with this is much easier via depot than package.

6.0 Recommendations

This section summarizes recommendations buried in the text.

7.0 Changelog

$Log: env.html,v $
Revision 1.1.1.1  2003/02/25 19:35:05  wcw


Revision 0.7  2003/02/22 16:18:45  wcw
. fixed style sheet path
. added comments about specific versions in depot.pref.proto and dest
paths in package.proto

Revision 0.6  2003/02/22 16:01:15  wcw
minor formatting

Revision 0.5  2003/01/21 19:17:21  wcw
larry's pass