Andrew Web Publishing System Design Document

by Doug DeJulio (based on a template by Joshua Finkler and Audrey Mcloghlin)

Essentials

Introduction

Computing Services currently offers two web publishing mechanisms; they offer different features, and use different interfaces. Users find this situation confusing, and are also looking for features beyond those supported by either of the current systems. This document describes the proposed design for a replacement system.

This system allows users to:

Historical Considerations

Design Tradeoffs

All of the competing web publishing systems I was able to discover were intended for organizations with stricter requirements than we will be placing on our users. For example, most were designed to use templates to enforce a consistent look and feel accross all pages. Because of this, they often ended up placing constraints on the sort of content published through them. For example, some of them require that content be written in XML rather than plain HTML.

Systems that did not impose this sort of restriction also did not attack enough of the problem to be considered a solution (eg. Apache's mod_dav is really only useful for moving content from the desktop to the web server). Some of these may be useful as components of the system we build (eg. we're looking at using mod_dav in conjunction with some access and authorization modules to support some users).

Storage subsystems were another area that required attention early in the process, since this selection affected future design choices. I evaluated high-end fileserver systems from NetApps, Oracle's iFS product, and AFS as well as "standard" local disk storage. Pricing seems to eliminate NetApps and iFS from consideration; our current AFS quotas do not meet the "must" level of our requirements for storage (150 MB vs 200 MB), and are well short of our "should" requirement (2 GB). As a result, the current plan is to use local disk as the storage medium for all hardware in the web publishing system.

In broad strokes, the current design is this: users transfer content from their client systems to a staging server via FTP or kFTP. WebDAV and other transfer protocols are future options but will not be implemented in this release. Staging server content is not available to the general public, but only to each collection's specified owners, administrators, writers, and reviewers (detailed descriptions of these roles follow in User Operations, below). The collection owner/admin can specify whether their collection uses a simple or complex publishing model. In the simple model, new content on the staging server is published to the production server in roughly ten-minute intervals. In the complex model, content is available for review on the staging server until the collection owner, admin or writer chooses to publish the content.

User Operations

There are five types of user for this system.

Addressing the last two roles first:

Readers are persons permitted to view collection content on the production web server. (By default, production server content is accessible to unauthenticated users unless a reader list is specified.) Collection readers have no ability to change anything in the system; an entity specified as a reader can simply view web content on the production server once they have authenticated. As a result, this system offers no user interface for readers.

System administrators have two special abilities: They can change collection owners, and they can view all log files (publication and web/access). These functions are accessed through unix administrative tools, and the specifics of these actions have not been determined at this time. Since they are internal to the system and have no user visibility, we will defer detailed description of these functions.

The abilities accorded the first three roles can be summarized as follows:

Collection Owners

Web Publisher/Writer

Reviewer

Note that a collection owner has all abilities accorded to writers and reviewers. This section will describe the functions available to a collection owner; these functions behave identically in each role. Each description below assumes that the user has already authenticated to prove their identity, and that the user is authorized for this role. Further, it refers only to the collection in which they serve this role; a given individual may serve as owner, writer, reviewer, and reader for a variety of collections.

View staging server content

Content on the staging server is not accessible to the general public by default; for a user to access content for a given collection on the staging server, they must be an owner, admin, writer, or reviewer for that collection.

Access publishing log

This log records all publishing events; ie, when was content moved from the staging server to the production server.

Immediate link checking

Performs an on-demand check of all links in a collection, and returns a list of broken links (if any) to the user upon completion.

Web log access

This log records all http accesses to a collection. The user can view simple reports, or download the raw log data.

Upload content

The ability to transfer content from a client machine to the staging server. Initially this is limited to FTP and kFTP but may expand in the future.

Publish content

Move the entire contents of a collection from the staging server to the production server. This option is only available for the complex publishing model; in the simple publishing model, content moves from the staging server to the production server when on a preset, systemwide interval. In the complex model, users can specify that the content be published immediately, or at a specific time.

Set link checking reports

The ability to specify when the automatic link checking utility is to be run, and to whom broken link reports should be sent. These reports may only be arranged by owners, admins, or writers within a given collection.

Set link checking escalation

In some instances, the person maintaining a web page may not be the page's owner. In a typical situation, a department head may have a student maintaining a page. When broken links persist, the page's owner should be notified. This option allows collection owners to specify an interval for running the report, as well as how "stale" the links should be. For example, a user might want a weekly report of all links that have been broken for longer than two weeks. In our example, a department head may want to know every week about all links broken for more than two weeks in order to arrange for repairs, but they might not care about a regular weekly report (which their student staff would normally address in the normal course of maintenance).

Add/delete writers

The ability to add entities to the list of writers for a given collection.

Add/delete reviewers

The ability to add entities to the list of reviewers for a given collection.

Add/delete readers

The ability to add entities to the list of readers for a given collection.

Select simple/complex model

Select the publishing model to be used for the collection. In the simple model, uploading content results in the queueing of a "publish" event; in all other respects, the models behave identically "under the hood", although we have the option of providing users of the simple model with a simpler user interface.

Administration Front End

To administer a collection, a user accesses a central web administration page. They are asked to authenticate; when they do this, we retrieve their collection memberships from a directory, and present them with the list of collections to which they have access as owners, admins, writers, or reviewers. When they select a collection, a page appears which present the options appropriate to their role in that collection. Observations/implications of this design: This design prohibits users from entering random URLs when accessing collections; this should make life simpler for most users, but might be cumbersome for users who have roles in multiple collections, as they must address each collection individually. Web logs will be available both formatted (using Analog or some equivalent) and raw. The page for a given collection will show a log of the most recent publishing events, and will have a mechanism for requesting more of this log.

Data Model Discussion

LDAP Schemas

For information on the format used for these schemas and for examples of other LDAP schemas, see http://ldap.hklc.com/.

objectclass: webCollection

Must Have:
Requires:
May have:

The publishing event handler will only perform operations that are valid for a collection with a given webCollectionStatus. For example, if a collection is currently "archived", deletion may be possible but publishing will not be.

If you wish to grant webCollectionReviewer, webCollectionReader or webCollectionWriter rights to multiple users or groups, then multiple instances of the attributes must be present with one user or group per instance.

objectclass: webACL

Must Have:
Requires:
May Have:

SQL Tables

WebCollections table

collection string
wcid integer (unique counter)

WebPubEvents table

collection string
revision string
eventstatus "pending", "done", "refused", or "failed"
action "publish", "update", "archive", "delete", or "log"
userid string
queuetime date/time stamp
schedtime date/time stamp
start date/time stamp
finish date/time stamp
message long string

The collection must match the cn of an existing LDAP entry with objectclass webCollection. The queuetime records when the record was inserted into this table, and is used for logging. The schedtime is the schedule time when the action should be taken. The start entry is the time when the publishing system has started the action (ie. it should be set immediately after an SQL "BEGIN" statement), and the finish entry is the time when the publishing system has finished the action (ie. it should be set immediately before an SQL "COMMIT" statement).

WebColResources table

collection string
url string
version string
base binary

When a web collection is published, all of the URLs for resources in it are written to this table, and the base URL it was published to is written

Legal Transactions

Most of the transactions in the system will be so simple that there's no question of their atomicity. For example, to request a scheduled publication, you just insert a correctly-formatted row into the WebPubEvents table. For a single row insert, there should be no issues surrounding atomicity.

Setting an Expiration Date

Requirement 400.60 states that web publishers should have the ability to specify an "expiration date" for each collection they manage. This would be done by adding either a "delete" or "archive" event in the WebPubEvents table. If there's a forwardingUrl in the corresponding LDAP object, the labeledURI will redirect viewers to the new site.

Note that the difference between "delete" and "archive" is that an archived collection still exists on the staging server. The collection publishers will still be able to access the content on the staging server.

Viewing the Web Publishing Log

SELECT * FROM WebPubEvents WHERE collection="colname";

Generate a List of All a Collection's URLs

This is how the link checking subsystem for example could obtain an accurate list of every single URL associated with a given collection.

SELECT url FROM WebColResources WHERE collection="colname" AND version="symver";

Uploading a File

This is complex because of two specific items in the requirements document.

Requirement 100.15 requires quotas to be enforced per-collection rather than per-user, and most existing quota systems work on a per-user basis. If we cannot depend on the underlying system to enforce these quotas, the file upload system will have to.

Requirement 200.1 requires content to be published "immediately" after changed files are stored. This means file uploads are going to have to trigger publishing events for the collection they're associated with.

Requirement 100.1 specifies that FTP and kFTP must be valid ways of uploading content. This means we will have to modify our already-modified version of wu-ftpd to support per-directory quotas and running "triggers" on file uploads. To avoid this work, we would have to modify both requirements, "go back to the drawing board" and redesign this whole system, or reconsider some alternative subsystems that we're currently rejecting (such as NetApps or Oracle iFS; reconsidering the use of AFS would require throwing out other requirements as well).

If we wish to support WebDAV in addition to FTP, we would have to modify Apache's mod_dav as well, and we would have to make sure the authentication system used by the web server could be made compatible with commonly used WebDAV clients (such as Dreamweaver and GoLive) without compromising security. Note that this could be done at a future date if we decided WebDAV support was desirable but not required for the initial implementation.

The filesystem that the content is transferred to will be available via http. If there are webCollectionReviewer attributes in the LDAP entry for the collection, it will only be viewable to those listed. Otherwise, it will be limited to those listed in any webCollectionReader attributes in the ACL objects. If there are none of those either, it will be visible to anyone who can access the http server (which itself may or may not be limited to members of the CMU community). This will be how the staging server functionality is implemented. When collections are archived, this will also be how the archived content is accessed.

Listing All a User's Collections

First, do an LDAP search to determine all the groups the user is a member of. Then, iterate through each group. For each group, list all the collections where the group is either the owner, webCollectionAdmin or webCollectionWriter. Then repeat this for the user directly. This will be a complete list of all of the collections that the user has any special administrative access to beyond the ability to read it. So, for example this could be used to generate a menu of web pages that the user could publish.

Handling Account Status Changes

We first subscribe to the campus LDAP trigger server, requesting notification for account status changes. When an account changes status, we query the LDAP server to get the new status. We then make an additional LDAP query to get a list of all groups the account is a member of. Then we iterate through all of the groups plus the account name, searching for all collections that have either one of the groups or the account name as their owner. For each such collection, we determine if any change in the status of the collection is apropriate. Some changes can be implemented by changing the webCollectionStatus attribute in the apropriate LDAP object, and relying upon the web server's access control mechanisms to enforce the new value. Other changes will require the queueing of an event for handling by the event handler(s), or a combination of these.

Queueing an Event

Insert a row into the WebPubEvents table. The collection is the name of the collection. The eventstatus must be "pending". The action must be set to the type of event (eg. "publish"). The queuetime should be set to the time that the row is inserted, for accurate logging. The schedtime may be set, to cause the event to be delayed until after that time. The message may be set, if there's a desire to have a temporary note available on the logging page before the publishing event occurs (eg. "this is going out on Saturday morning because traffic is low then"). Once the row is written, it must not be edited by any subsystems other than the event handler(s).

Handling the "Publish" Event

Note that all behavior of this event will be encapsualted in the event handler code. This means changes to this behavior can be implemented by making modifications in a single place. Other subsystems should not assume all of the steps will always be done in this precise way, and should only concern themselves with the pieces of data they have legitimate interest in.

The publishing system event handler first checks to make sure that the operation is permitted for the collection (eg. its webCollectionStatus is not "archived"). If the transaction is not legal, change its eventstatus from "pending" to "refused" and set the message, so that a record exists in the publishing log.

If the operation is legal, the next step is to check in (via the revision controll system in use) the current revision of all of the resources on the staging server. If this fails (for example because it would exceed the collection's quota), the eventstatus is changed from "pending" to "failed" and the message is set. If it succeeds, the revision is set to the symbolic version that was just checked in via the revision control system, and the start time is set.

It then does a "BEGIN". A directory for the new content is created in the production web system storage. An ".htaccess" file is written that implements various collection-specific for the collection, such as binding the collection name to the log file entries and implementing access control and authorization. For each resource in the collection, the last checked in version of that resource is checked out into the new directory. As each resource is checked out, a row is written to the WebColResources table with the URLs that resource will be available at, and the symbolic version of the current content (taken from the revision control system). When we're done putting all the new resources in their new home on the production server, the last step is to atomically replace the link to the old directory with the link to the new directory so the httpd process picks up the change.

The very first time a collection is published to a given URL, some special one-time data must be written to a server configuration file. In order for this data to be picked up by the web server, the server must be restarted (for example with an "apachectl graceful" command). We believe this should not result in observable downtime, but this will also not be scheduled as a frequent event (twice a day). As a result, publishing content for brand new collections will appear to take a little longer than publishing content for existing collections.

When all of this has been succesfully done, the old directory (if any) is deleted, the finished time is written, the eventstatus is set to "done", and we do a "COMMIT". If it fails, we instead do a "ROLLBACK", delete the new directory we created, change the eventstatus to "failed", and set the message so that the activity is recorded for the log.

Handling the "Update" Event

The "update" event is used to update the metadata of a collection without updating its content. It is a much more lightweight operation than a full "publish" event.

The publishing system event handler first checks to make sure that the operation is permitted for the collection (eg. its webCollectionStatus is not "archived"). If the transaction is not legal, change its eventstatus from "pending" to "refused" and set the message, so that a record exists in the publishing log.

The next step is for each ".htaccess" file associated with the collection will be dynamically re-generated. Note that a collection will only have multiple ".htaccess" files if it has been published to more than one location. A collection with multiple directories that is published to exactly one location will only have an ".htaccess" file at the top level directory, because this will give it sufficient scope to cover the entire collection.

If there is an error writing any of the ".htaccess" files, change the eventstatus to "failed" and set the message so that the activity is recorded for the log. If there is no error, set the eventstatus to "done" and set the finished timestamp.

Handling the "Archive" Event

The "archive" event is used to remove content from the production servers while making sure it remains avaiable to the people working on the collection.

When the status of a web collection is "archived", the authorization logic on the staging server ignores the webCollectionReviewer attributes. This restricts the content to those working on the collection.

Handling the "Delete" Event

System Interfaces

Filesystem Access Interface

For the initial implementation, we plan to use either a network attached storage device or local storage on the web servers. In either case, the publishing system server will be able to access the filesystem as if it were native (for example via NFS). Web publishing filesystem operations will then be performed as if they were occuring on local disk. If we do not use an encrypted method for this, the systems will have second network interfaces and a private network, and filesystems will only be published on this private network.

Note that if there's more than one system providing production web service for a given collection, there's a very minor atomicity issue that would be extremely difficult to overcome, in that if local storage is used it's in practical terms impossible for us to make a single operation formally atomic if it occurs on two distinct filesystems on two distinct loosly-coupled systems, and if network storage is used, an atomic change to that storage may not be picked up by all the clients at exactly the same moment and so may very briefly appear to be inconsistent between different production servers.

In the long term, we must support multiple mechanisms. For example, we might publish data to a web server in the chem department by using "scp" for file transfer and "ssh" to execute commands to link and rename files, and we might use WebDAV between the staging server and production server to push content to departmental servers running Microsoft's IIS. This should only require changes to the "event handler" software of the publishing system, as all other subsystems will be isolated from the filesystem and will only access it by queueing events.

Group Authentication Interface

The publishing system will need access to information about groups, such as all the faculty members of an academic department, or all the students taking a given class. This information is going to need to be in the LDAP directory at some point. We will rely on the generic LDAP group mechanism, and interact with it entirely by doing LDAP queries.

Backup System Interface

A campus-wide backup mechanism used for other servers that do not store all of their data on AFS will have to be used. It is hoped that the backup mechanism can be provided access to the file store used, so that it can "pull" data rather than forcing the publishing system to "push" it.

Future Improvements/Areas of Likely Change

If the initial implementation only allows file uploads via FTP/kFTP, then one obvious area for improvement would be additional mechanisms for file upload. One likely mechanism would be WebDAV, implemented via a modified mod_dav for Apache, and another would be SMB, implemented via a modified Samba.

The capability to publish a single collection to multiple virtual servers is certainly planned. The capability to publish to multiple physical servers, not all of which share filesystem space, is something we can consider adding.

The initial implementation uses a revision control system internally, but does not expose it to web publishers. More sophisticated use of the revision control system is planned for future development, including the capability to roll back content and to schedule the release of particular versions. This latter capability will for example permit users to upload content that will change weekly all at the beginning of a semester and set up a schedule to publish it all at the correct times.


Appendix 1: Implementing User Operations in Detail

Top Level Interface

The individual web publisher will access the publishing system by pointing their web browser at a particular URL on the staging server. They will be asked to authenticate. The user ID will be used to generate a list of all the collections the user has access of any kind to, along with the type of access the user has to each collection (owner, admin, publisher, reviewer). The list will be presented to the user -- ordinary users will never type in a collection name.

From this list, the user will be able to follow a link to the collection's content on the staging server, to the collection's content on the production server, and to the collection's administration page. The administration page will show only the options that that the user is permitted for the collection chosen.

Operations on Collections

View staging server content

Each collection gets a URL on the staging server, and anyone with the correct rights can just point their web browser at it. For example, the "cs.211" collection will appear at "http://kresgie.andrew.cmu.edu/collections/cs.211/".

When a web browser is pointed at this URL, the Apache module for access control will check to see if the collection has a status of "deleted", and will deny access if so. If the content on the staging server is viewable by the public, it will be delivered. Otherwise, the user will have to authenticate, and the Apache module for authorization will check the LDAP server to make sure the authenticated user has the right to view the content before delivering it.

Access publishing log

The web publishing log can be fetched via an extremely simple SQL query. The default administration page for each collection will show the last ten publishing events. The user will have the option to display the entire publishing log.

Immediate link checking

An interface for doing an immediate one-time run of the link checking system will be provided, but as the link checking system is a distinct project, the details for what will occur will have to appear elsewhere.

Access web log

An interface for accessing simple "cooked" log reports as well as downloading raw log data will be provided, but as the logging system is a distinct project, the details for what will occur will have to appear elsewhere.

Upload content

Users will upload content via FTP and/or kFTP.

Publish content

There will be a "publish" button. Optionally, the user may enter a date and time at which to schedule the publishing. In any event, the data is used as per "Queueing an Event" in the "Legal Transactions" section of this document.

Set link checking reports

Set link checking escalation

Again, interfaces for these will be provided, but the details of what information is required and how that information will be used will have to come from the link checking project.

Add/delete writers

Add/delete reviewers

Both of these will be extremely simple. For addition, the user will be able to enter either a user or group name to be added. We can do an LDAP lookup to verify that the user or group exists, and then we add another instance of the apropriate attribute to the LDAP object.

For deletion, the user will be presented with a list of current users and groups in each category, and will be able to select items for deletion. The corresponding attributes will be removed from the LDAP object.

Note that this is a relatively expensive operation, as it involves writing to the LDAP server. Users should be discouraged from doing this frequently.

Add/delete ACLs

This involves taking a URL and a list of users that can access that URL, and creating a new LDAP object based on these.

Add/delete readers

This is similar to the operation that adds and deletes writers and reviewers, except that it operates on ACL objects rather than collection objects.

Select simple/complex model

A simple switch will allow the user to switch to or from the complex publishing model. When going from the simple model to the complex model, the attribute in the LDAP object is updated. When going from the complex model to the simple model, the LDAP object is updated and a "publish" event is queued (to make sure the current content in the staging area is in sync with the production server).


Revision History

Document Revision # Action Taken, Notes When? By Whom?
0.8 Reintroduced "System Interfaces" section, to give more information on how various transactions will actually be implemented. 2001/7/9 Doug DeJulio
0.81 Added the "webCollectionAdmin" LDAP attribute, for cases where the "owner" must delegate day-to-day operations -- apparently, this is expected to be common. 2001/10/25 Doug DeJulio
0.85 Added the "webACL" LDAP objectclass, and changed the rest of the document to reflect its addition, because ACLs with a granularity of whole collections were too coarse. 2001/10/31 Doug DeJulio
0.86 Changed the LDAP "webPublishingModel" to "webPubSimpleModel" (a boolean attribute) to reflect a change reality inflicted upon our data model. The change occured some time ago, but I forgot to update the document, sorry. 2001/11/12 Doug DeJulio
0.87 Changed the description of the "labeledURI" attribute of the "webACL" objectclass, to clarify how it is used. The change is particularly important for collections that are published to multiple URLs. 2001/11/13 Doug DeJulio
0.88 Added "userid" field to "WebPubEvents" SQL table, so the publishing lock can track who was responsible for each event. 2001/12/03 Doug DeJulio
0.9 Added "log" value to "action" column of "WebPubEvents" SQL table, to be used for items that must be logged after they're acted upon but which are not acted upon by event handlers. An example of this is the editing of an ACL. 2002/04/26 Doug DeJulio