Log file analysis

Current log analysis

Currently log data is analysed within an Excel spreadsheet through the use of an ODBC connection to the PIE database. Several queries have been set up, most of these are re-run manually approximately every month. The current queries and analyses are detailed below under "Log file analysis software - requirements specification".

Advantages of the current method

Disadvantages of the current method

Log file analysis software - requirements specification

The purpose of log file analysis software would be to produce analyses of user behaviour and of resource usage. These analyses could be used to inform collection development decisions, decisions regarding the optimum number of concurrent users (where licensing restrictions exist), effective marketing of the PIE and of the resources that it provides access to. The analyses could also be used to inform the forthcoming studies of PIE user behaviour and of PIE performance (the latter will look at the technology specifications required to support specified numbers of users).

The main advantage of log analysis software is that it would automate the collection and collation of data and hence the administrative overheads would be reduced. The log analysis software would include a wider range of queries than are currently available. However a considerable amount of time would be required to develop the software.

The log file analysis software would perform all of the current queries and would also the include 'desirable functionality' relating to these queries (detailed below). The forthcoming 'Library staff evaluation study' will seek to inform on further desirable queries.

Current queries

1. Number of users and the number of times each user has used the PIE

Data is collected from the following fields in the userdata table: mail, lastactiondate

This particular query is run approximately every few days. The data is manually listed according to user type (e.g. student, academic, library staff, guest user, etc.). The total number of users and the number of users within each type is manually counted. The 'lastactiondate' indicates when the user last used the PIE. This is manually highlighted if it has changed since the query was last run and hence the approximate number of times that each user has used the PIE can be recorded.

Desirable functionality

2. The total number of resources

Data is collected from the following fields in the itemtable: itemid filtered by type equals resource, location

This query allows the total number of resources contained within the PIE to be (manually) calculated. The resource location will either be a location contained within the rdb or in the case of web-links, the url. The number of rdb resources and web-links is manually calculated.

3. Resources not owned by headliner

Data is collected from the following fields in the item table: itemid filtered by type equals resource and ownerid does not equal headliner

This query retrieves the itemid and ownerid of all resources that are owned by PIE users other than headliner. The number of non-headliner resources provides an indication of how actively the users have been adding resources to their personal pages within their PIE. However, the number of non-headliner resources can rise or fall because users are able to both add and delete resources (and whole pages of resources).

Desirable functionality

4. Number of pages

Data is collected from the following fields in the item table: itemid filtered by type equals page, ownerid

This query enables the manual calculation of the total number of pages, the number of pages owned by users and the number of system (and headliner) pages. The number of users that own more than one page is manually calculated.

Desirable functionality

Desirable Queries

1. List creation

2. Search behaviour

3. Use of specific pages, including Help and Customisation pages

4. Use of discussion areas (if this feature were to be implemented)

5. Use of an integrated EEDD (if this feature were to be integrated within the PIE)

Interface description

The administrator will firstly authenticate themselves. All possible queries (detailed above) will then be listed. The administrator will be able to: select which of these pre-defined queries are to be run and how frequently; to generate either a compete report for a specified period (including data from all possible queries), or to generate a report containing the data from selected queries. The administrator will be able to specify how often a particular report should be run. It will be possible for the administrator to print out reports, to save reports and to download data from reports into an Excel spreadsheet.

Processing a log file is likely to be a big task which requires lots of resources; what we did in Decomate was to take a month's worth of log file and store it separately, and process the information to be put into a database. This process took several hours for each log file, though Decomate is much more heavily used than the PIE. Accessing more than a month's worth of data from Decomate's database proved to be impossible for Excel for memory reasons. We probably need to do something along the lines of storing each query's results when it is run and then using Excel to combine them; this would be much quicker once the log file for a particular period has been processed.