Andrew Software Management

1.0 Why?

We're all busy people. Why do any of us want to spend time learning our esoteric software management system? The goal is to minimize the amount of time we spend working on individual systems. Rewashing software for the same platform again and again (or manually installing the software over and over) takes a lot of effort, allows for more mistakes to creep in, and all around is just a bummer. We simply don't have enough people to do individual management of all of our unix machines.

The Andrew environment is designed to allow all of us to share the grunge work of making systems work, without preventing per-machine flexibility. The downside is that it has a learning curve all its own—but one which you will hopefully find well worth it.

2.0 What is this junk?

depot is primarily responsible for managing collection versioning. depot takes over the management of a directory hierarchy (in our case, depot manages /usr/local, /usr/contributed, /usr/ng and /usr/host). No changes happen inside of this hierarchy without depot making them, ensuring that changes are reversible and reproducible. depot works by linking or copying various collections to the target directory and ensuring that these collections don't conflict. Individual collections can then be upgraded or installed and each file belongs to one and only one collection. depot by itself understands very little of versioning or per-machine customization; we use dpp, the depot pre-processor, for per-machine customization.

package is responsible for management of the operating system and other boot time configuration. It is the primary method by which we customize individual or classes of machines and keep them consistent over time. package itself is very stupid; it merely knows how to make a filesystem resemble its configuration file. package is also usually the most irritating program on our system, since it will delete files that don't match its configuration file—all of us have seen package delete something we wanted to keep. Like depot, package by itself is fairly stupid and doesn't allow any simple inheritance. We use yet another pre-processor, mpp, to provide these features for package. Along with our use of mpp, we use a large set of conventions to make our package environment comprehensible.

${wsadmin} or /afs/andrew.cmu.edu/wsadmin is the directory hierarchy on AFS which holds large numbers of fragments of package (and some depot) configuration files. The mpp processor knits these fragments together to form a complete package configuration. There are specific conventions we use to enable services and tweak options on machines. For example, to enable Apache on a machine all that is needed is %define doesapache in the /etc/package.proto. This line automatically includes the tens to hundreds of lines of package configuration Apache would normally need.

emt works with adm to perform delegated software management. emt manages a set of environments (a "beta" and a "gamma" environment for each operating system type (aka systype) and allows collection maintainers to release software to those environments. Since emt uses fairly long and annoying commands, the Perl script carpe generates the appropriate command to run after a simple interactive dialog and automatically e-mails it to a bboard (these bboards start with org.acs.asg.request). Individual maintainers can generally affect the beta environment directly. Gatekeepers are responsible for releases to the gamma environment.

3.0 How can we use the process to our advantage?

Most of these things are ideas on what we should be doing, not only (or not just) we're necessarily doing now.

Beta releases are at the discretion of the collection commanders.
Gamma releases every Wednesday afternoon. All queued release requests go out every wednesday unless there are known problems with the version in beta. Why Wednesday? The Help Center has the lightest call volume at the end of the week and people will still be around to fix things in case they break.
How about gamma releases go out every tuesday afternoon. test machines should reboot Wednesday morning. production machines reboot thursday morning. Of course, this likely increases downtime for the user as then the systems are down EVERY week for a (hopefully) short period of time instead of it being down a some longer interval for perhaps a longer period of time -- or us struggling to bring things back after a big failure (e.g. powerloss).
this assumes that there are a large class of test gamma machines. this is not the case. i don't think every production machine rebooting every week is a good goal to shoot for.
Gamma software does not change besides on Wednesday unless there is an emergency fix (see below).
All software should go to gamma before it is deployed on any production machine.
Any software that sits in beta for more than 4 weeks without a gamma request or a new beta release causes the collection commanders to be bugged.
how to automate this or make sure it happens?)
Backup /usr/{local,host,contributed}/depot/depot.pref for all production servers (for last resort restore when a root disk crashes). depot.pref is the result of dpp running on the depot.pref.proto and is sufficient for depot to rebuild the trees exactly as they last were. It reflects the current state of released software with any overrides in the depot.pref.proto.
Use beta machines for development. Compile all software besides "emergency" fixes on beta machines. If your software relies on another collection that is different between beta and gamma, make every effort to push that collection out instead of just compiling against gamma.
Verify some sort of functionality before releasing to beta (program starts, doesn't reformat hard disk, etc.)
Testing machines may be beta (early testing) or gamma (late testing and verification before pushing to production servers). Some internal services (such as asg.web.cmu.edu or bugzilla.andrew.cmu.edu) may run as beta machines to try to ensure greater testing of the beta environment.
Use gamma machines for production use. Production machines should copy as much software as possible to avoid problems with AFS outages or unexpected software changes when something gets released.
create and use some sort of "%define copymost"?
Avoid referencing specific versions in depot.pref.proto. It's tempting to require a production machine to use specific versions of software, but experience shows that unless the machine maintainer is paying close attention, version skews between collections can start causing subtle problems.

3.1 Which environment do I use

/usr/local - anything with a command that an end-user may need to run.

/usr/contributed - anything not officially supported or not "system overhead."

/usr/ng - If you are in the network group and have something that is mainly to support the activities of the Network Group, it goes here. If you're unsure whether something belongs here or in /usr/local, put it in /usr/local.

/usr/host - Software that provides services but has no user runnable commands or libraries that people may want to link against for their own programs. Currently, machines by default don't have an actual /usr/host directory. This should be changed.

3.2 Where do files go? /afs/andrew/wsadmin? data/db?

/afs/andrew.cmu.edu/wsadmin/services - Things that one expects lots of other people to use.

/afs/andrew.cmu.edu/data/db/<a_service> - Things that one expects lots of other people to use.

/afs/andrew.cmu.edu/wsadmin/<your_service> - The specific instance of your service. Put specific server configurations, both package configuration and configuration files that package pulls in (e.g. inetd.conf, user.permits) in this directory hierarchy.

4.0 Specific examples

4.1 Major upgrades

4.2 Minor upgrades

4.3 Root disk crash

4.4 Emergency infrastructure fixes

4.5 bboard posts, release, upgrades, etc.

4.6 Emergency application fixes

5.0 Configuration files

5.1 Workstation configuration

Clusters are gamma machines.

Computing service desktop machines are generally beta machines. You might want to have /usr/local/depot/depot.pref.proto:

%define beta
%define tree local
%include /afs/andrew.cmu.edu/wsadmin/depot/src/depot.include


searchpath * ${local}

collection.installmethod copy lemacs,kerberos,com_err,gnucc,gdb

Add to the list depending on what applications you use frequently. (This is only for better performance.)

Your workstation will depot nightly. You can cause depot to use a specific version of a collection with a line like:

path cyrus ${dest}/cyrus/064

This will cause the Cyrus version 064 to by installed on your computer. This is useful for testing new versions before beta release to ensure proper functionality or examining how old versions worked.

You want to reboot whenever new OS versions are put into beta (see bboards); probably around once a month is a good choice, or after you run package. Always reboot after running package!.

5.2 Production servers

The primary question for production machines is "how often should they update"? The more frequent they update, the more times something may break—and frequent updates means that people are probably not paying close attention to each update. On the other hand, less frequent updates cause each update to be much bigger, which means tracking down what change caused a bustage can be much more complicated. Infrequent updates can also complicate security fixes—ideally, security fixes would require a very small software change but if a machine is too far behind the times, it will require a special version or a large update to stabilize.

If possible, production machines should reboot weekly, causing depot and package to run at each reboot. Generally, redundant services such as SMTP servers, Unix servers, or DNS servers should have no problems meeting this requirement, since they can reboot on a staggered schedule and cause little or no user visible outages. (Our users are remarkably tolerant of daily outages: the Unix servers are unavailable for 10-30 minutes every day with few complaints.) A single redundant server can be down for an extended period of time, so if an environment change has broken the server it is not a catastrophe.

TODO: We should make it easier to stagger reboots of 'identical' systems with a %define

Non-redundant servers need to balance the need for uptime versus the resources we want to spend as system administrators. While we've made some changes to package and depot to have them run faster, our server hardware tends to reboot slowly. Non-replicated file servers (such as Cyrus backends or AFS user servers) can cause interesting questions. Lately, we've rebooted AFS servers weekly (with little complaint) but have attempted to minimize the downtime for Cyrus backends. Non-replicated services can also suffer from the "unintended upgrade" effect: a seemingly unrelated change causes downtime, and causes downtime when no system administrator is immediately available to fix it. Possible remedies to this include:

careful monitoring of release logs by system administrators
automated testing by a regression suite on a beta machine

Production servers may also want to have /etc/NoPackage and /etc/NoDepot created after the machine starts. This way, if the machine happens to crash hard during the day, recovery time is much faster. One must take care to remove these files on regular reboots to ensure updates happen.

TODO: We should make this a %define

It is discourged to have specify specific paths in the depot.pref.proto of your production servers as the default behavior. This is fine for early testing or to work around a specific bug but the goal should be to not have to specify specific versions of a collection. The reason is that unless you pay attention to the releases, versions may get deleted out from under you and dependency problems may sneak in and cause problems later.

It is also discouraged to use /afs/andrew/system/dest paths in package configuration files. If you need to reference a specific version of a collection, it is preferable (though still discouraged) to do so by specifying a version via the depot.pref.proto. Having specific versions referenced in package files makes it more difficult to upgrade systems as some software may not exist in the new @sys. Dealing with this is much easier via depot than package.

6.0 Recommendations

This section summarizes recommendations buried in the text.

make all hosts have a /usr/host symlink or actual local depot repository.
encourage beta releases at any time.
require that gamma releases go out at a regular schedule. See the process section for details.
Don't hard code paths in the depot.pref.proto.
Don't use dest paths in the package files.

7.0 Changelog

$Log: env.html,v $
Revision 1.1.1.1  2003/02/25 19:35:05  wcw


Revision 0.7  2003/02/22 16:18:45  wcw
. fixed style sheet path
. added comments about specific versions in depot.pref.proto and dest
paths in package.proto

Revision 0.6  2003/02/22 16:01:15  wcw
minor formatting

Revision 0.5  2003/01/21 19:17:21  wcw
larry's pass