Web Publishing System Design Document

by Doug DeJulio (based on a template by Joshua Finkler and Audrey Mcloghlin)

I. Essentials

II. Introduction

III. Historical Considerations

IV. Competitive Analysis

All of the competing web publishing systems I was able to discover were intended for organizations with stricter requirements than we will be placing on our users. For example, most were designed to use templates to enforce a consistent look and feel accross all pages. Because of this, they often ended up placing constraints on the sort of content published through them. For example, some of them require that content be written in XML rather than plain HTML.

Systems that did not impose this sort of restriction also did not attack enough of the problem to be considered a solution (eg. Apache's mod_dav is really only useful for moving content from the desktop to the web server). Some of these may be useful as components of the system we build (eg. we're looking at using mod_dav in conjunction with some access and authorization modules to support some users).

V. Design Tradeoffs

VI. Data Model Discussion

LDAP Schemas

For information on the format used for these schemas and for examples of other LDAP schemas, see http://ldap.hklc.com/.

objectclass: webCollection

Must Have:
Requires:
May have:

The publishing event handler will only perform operations that are valid for a collection with a given webCollectionStatus. For example, if a collection is currently "archived", deletion may be possible but publishing will not be.

SQL Tables

WebPubEvents table

collectionstring
revisionstring
eventstatus(one of a short list of enumerated values)
action(one of a short list of enumerated values)
queuetimedate/time stamp
schedtimedate/time stamp
startdate/time stamp
finishdate/time stamp
messagelong string

The collection must match the cn of an existing LDAP entry with objectclass webCollection. The queuetime records when the record was inserted into this table, and is used for logging. The schedtime is the schedule time when the action should be taken. The start entry is the time when the publishing system has started the action (ie. it should be set immediately after an SQL "BEGIN" statement), and the finish entry is the time when the publishing system has finished the action (ie. it should be set immediately before an SQL "COMMIT" statement).

WebColResources table

collectionstring
urlstring

VII. Legal Transactions

Most of the transactions in the system will be so simple that there's no question of their atomicity. For example, to request a scheduled publication, you just insert a correctly-formatted row into the WebPubEvents table. For a single row insert, there should be no issues surrounding atomicity.

Viewing the Web Publishing Log

SELECT * FROM WebPubEvents WHERE collection="colname";

Generate a List of All a Collection's URLs

This is how the link checking subsystem for example could obtain an accurate list of every single URL associated with a given collection.

SELECT url FROM WebColResources WHERE collection="colname";

Uploading a File

This is complex because of two specific items in the requirements document.

Requirement 100.15 requires quotas to be enforced per-collection rather than per-user, and most existing quota systems work on a per-user basis. If we cannot depend on the underlying system to enforce these quotas, the file upload system will have to.

Requirement 200.1 requires content to be published "immediately" after changed files are stored. This means file uploads are going to have to trigger publishing events for the collection they're associated with.

Requirement 100.1 specifies that FTP and kFTP must be valid ways of uploading content. This means we will have to modify our already-modified version of wu-ftpd to support per-directory quotas and running "triggers" on file uploads. To avoid this work, we would have to modify both requirements, "go back to the drawing board" and redesign this whole system, or reconsider some alternative subsystems that we're currently rejecting (such as NetApps or Oracle iFS; reconsidering the use of AFS would require throwing out other requirements as well).

If we wish to support WebDAV in addition to FTP, we would have to modify Apache's mod_dav as well, and we would have to make sure the authentication system used by the web server could be made compatible with commonly used WebDAV clients (such as Dreamweaver and GoLive) without compromising security. Note that this could be done at a future date if we decided WebDAV support was desirable but not required for the initial implementation.

The filesystem that the content is transferred to will be available via http. If there are webCollectionReviewer attributes in the LDAP entry for the collection, it will only be viewable to those listed. Otherwise, it will be limited to those listed in any webCollectionReader attributes. If there are none of those either, it will be visible to anyone who can access the http server (which itself may or may not be limited to members of the CMU community). This will be how the staging server functionality is implemented. When collections are archived, this will also be how the archived content is accessed.

Listing All a User's Collections

First, do an LDAP search to determine all the groups the user is a member of. Then, iterate through each group. For each group, list all the collections where the group is either the owner or webCollectionWriter. Then repeat this for the user directly. This will be a complete list of all of the collections that the user has any special administrative access to beyond the ability to read it. So, for example this could be used to generate a menu of web pages that the user could publish.

Handling Account Status Changes

We first subscribe to the trigger server, requesting notification for account status changes. When an account changes stats, we query the LDAP server to get the new status. We then make an additional LDAP query to get a list of all groups the account is a member of. Then we iterate through all of the groups plus the account name, searching for all collections that have either one of the groups or the account name as their owner. For each such collection, we determine if any change in the status of the collection is apropriate. Some changes can be implemented by changing the webCollectionStatus attribute in the apropriate LDAP object, and relying upon the web server's access control mechanisms to enforce the new value. Other changes will require the queueing of an event for handling by the event handler(s), or a combination of these.

Queueing an Event

Insert a row into the WebPubEvents table. The collection is the name of the collection. The eventstatus must be "pending". The action must be set to the type of event (eg. "publish"). The queuetime should be set to the time that the row is inserted, for accurate logging. The schedtime may be set, to cause the event to be delayed until after that time. The message may be set, if there's a desire to have a temporary note available on the logging page before the publishing event occurs (eg. "this is going out on Saturday morning because traffic is low then"). Once the row is written, it must not be edited by any subsystems other than the event handler(s).

Handling The "Publish" Event

Note that all behavior of this event will be encapsualted in the event handler code. This means changes to this behavior can be implemented by making modifications in a single place. Other subsystems should not assume all of the steps will always be done in this precise way, and should only concern themselves with the pieces of data they have legitimate interest in. In particular, note that the first draft only covers the case of publishing each collection to a single URL, but this can be (and probably will be) changed in the future without having much impact on the rest of the system.

The publishing system event handler first checks to make sure that the operation is permitted for the collection (eg. its webCollectionStatus is not "archived"). If the transaction is not legal, change its eventstatus from "pending" to "refused" and set the message, so that a record exists in the publishing log.

If the operation is legal, the next step is to check in (via the revision controll system in use) the current revision of all of the resources on the staging server. If this fails (for example because it would exceed the collection's quota), the eventstatus is changed from "pending" to "failed" and the message is set. If it succeeds, the revision is set to the symbolic version that was just checked in via the revision control system, and the start time is set.

It then does a "BEGIN", and we delete all entries from the WebColResources table for a given collection. A directory for the new content is created in the production web system storage. An ".htaccess" file is written that implements various collection-specific for the collection, such as binding the collection name to the log file entries and implementing access control and authorization. For each resource in the collection, the last checked in version of that resource is checked out into the new directory. As each resource is checked out, a row is written to the WebColResources table with the URL that resource will be available at. When we're done putting all the new resources in their new home on the production server, the last step is to atomically replace the link to the old directory with the link to the new directory (as determined by the collection's url LDAP attribute) so the httpd process picks up the change. When this has been succesfully done, the old directory is deleted, the finished time is written, the eventstatus is set to "done", and we do a "COMMIT". If it fails, we instead do a "ROLLBACK", delete the new directory we created, change the eventstatus to "failed", and set the message so that the activity is recorded for the log.

VIII. System Interfaces

IX. User Interfaces

There are three primary categories of user.

The owner of a collection has access to all of the same interfaces as all of the individual publishers for that collection. The system administrator is a special and very rare case. Unless a discussion mentions that it's specifically about the System Administrator, it will be considered to only apply to the other users.

When a user goes to the web administration page, we can generate a list of the collections that they have any kind of access to and use the list to build a menu or web page. So, there's no reason to ever let ordinary users type in the name of an existing collection.

The page for a given collection must show a log of the most recent publishing events, and must have a mechanism for requesting more of this log. It must provide a mechanism for publishing the collection immediately, and for scheduling a publishing event at a particular date and time. It should also provide a mechanism for editing the list of users and groups that can view the collection on the public server.

The owner must have certain other capabilities as well. They must be able to edit the list of groups and users with write access to the collection, and those with read access to the collection on the staging server. They also need to be able to switch between the "simple" and "complex" publishing models.

The precise details of the user interface such as page layout will not be specified in this document, as decisions about these details are more apropriately within the domain of User Services rather than the development group.

X. Future Improvements/Areas of Likely Change

If the initial implementation only allows file uploads via FTP/kFTP, then one obvious area for improvement would be additional mechanisms for file upload. One likely mechanism would be WebDAV, implemented via a modified mod_dav for Apache, and another would be SMB, implemented via a modified Samba.

The initial implementation only supports publishing a single collection to a single URL (for example, the collection "cs.courses.211" might be published to "http://www.cs.cmu.edu/courses/211"). The capability to publish a single collection to multiple virtual servers is certainly planned. The capability to publish to multiple physical servers, not all of which share filesystem space, is something we can consider adding.