Clusters Login Statistics
Specification Document
Revision 002 - 12/18/2001
Mike Kelleher - Clusters

1. Overview:
Clusters desires login data for its public labs for the purpose of generating statistics. These statistics will help Clusters determine appropriate upgrade paths and concentrate its support where it is needed most. However, login records are considered very sensitive personal information and cannot be exported en masse to Clusters directly. Additionally Clusters desires affiliation/class information about its users but does not have the knowledge required to correlate affiliations with user IDs.
To that end the Systems Group and Clusters will cooperate to get Clusters login data it needs to generate statistics while preserving the integrity of the Kerberos logs and the privacy of its users.
The result of this cooperation shall be a script, written and maintined by the Systems Group which will pull data from a Clusters-provided list of Cluster machines (defined below) and generate output files which will be readable by Clusters staff and used to generate statistics. These result files will not have any personally identifiable user information.


2. Script Input:
2.1. Clusters will maintain a text file on AFS which the script will use as input to filter the Kerberos logs. The file shall be TAB-delimited and contain the following fields, in order:
	2.1.1. MachineIPAddress (Internet dot-notation format)
	2.1.2. MachineLocation
	2.1.3. MachinePlatform
2.2. This file shall be located at /afs/andrew/acs/ac/clusters/login-stats/cluster-machines and will be known as the "cluster-machines" file.
2.3. MachineLocation and MachinePlatform have no semantic meaning for the script. The are used as sort terms. These strings are case sensitive so BH140 and bh140 will be considered separate locations. The strings must not contain the tab character.
2.4. If there are not exactly three TAB-delimited columns in any line the script will exit with an error.
2.5. Sample cluster-machines data can be found at /afs/andrew/acs/ac/clusters/login-stats/cluster-machines-sample


3. Script Output:
3.1 The script will be run weekly in production use. The script shall place its output in the directory /afs/andrew/acs/ac/clusters/login-stats/. Provided a PTS group or login ID, Clusters will provide appropriate read/write access in AFS to this directory.
The login stats script will generate two output files. In the filenames below, yyyy-mm-dd should be replaced with the date the script runs, in yyyy-mm-dd format.

3.2 Bulk Login Data
3.2.1. The first file, yyyy-mm-dd-logins will have one line per Cluster machine login. The file will be TAB-delimited. Any tabs in the input data shall be converted to space characters before output. 
3.2.2. The logins file will have the following fields, in order:
	3.2.2.1. Timestamp - mm/dd/yyyy hh:mm format.
	3.2.2.2. eduPersonAffiliation - (Student|Staff|Faculty)
	3.2.2.3. cmuDepartment - string
	3.2.2.4. cmuStudentClass - (Freshman|Sophomore|Junior|Senior|5thYear|GradStudent)
	3.2.2.5. MachinePlatform - from the cluster-machines input file
	3.2.2.6. MachineLocation - from the cluster-machines input file
3.2.3. Sample bulk login output can be found in /afs/andrew/acs/ac/clusters/login-stats/yyyy-mm-dd-logins-sample


3.3. Summary Data
3.3.1. The second output file, yyyy-mm-dd-summmary will contain summary information about user logins. This file will also be TAB-delimited. The file will have one unique platform/room/date combination per line.
3.3.2. It will contain the fields:
	3.3.2.1. Date - mm/dd/yyyy
	3.3.2.2. MachinePlatform - from the cluster-machines input file
	3.3.2.3. MachineLocation - from the cluster-machines input file
	3.3.2.4. totalLogins - int
	3.3.2.5. uniqueUserLogins - int (number of unique users seen)
3.3.3. The script will also generate daily totals by MachinePlatform, MachineLocation, and both. (see sample output)
3.3.4. Sample summary data can be found in /afs/andrew/acs/ac/clusters/login-stats/yyyy-mm-dd-summary-sample


3.4. Missing values in any output file will be indicated by "NIL".