Log file analysis
Current log analysis
Currently log data is analysed within an Excel spreadsheet through the
use of an ODBC connection to the PIE database. Several queries have been set
up, most of these are re-run manually approximately every month. The current
queries and analyses are detailed below under "Log file analysis software -
requirements specification".
Advantages of the current method
- The queries were relatively quick and easy to set up (but knowledge
of how to set them up and what data to sample and correlate is required).
- The current method didn't require any programming (processing the PIE
database would be a time consuming task) and is a solution which uses all third
party products likely to be readily available in an academic environment.
Disadvantages of the current method
- The amount of time that it takes to collate the data can be
considerable. It currently takes about 15 minutes to collate the data for query
1. below (Number of users and the number of times each user has used the PIE).
The amount of time taken to collate the data increases as the number of PIE
users increases.
- The reliance on the administrator to remember to run the queries
regularly is also a disadvantage. It is not always possible to run the queries
regularly due to other task commitments.
- The data is sampled 'periodically', changes that occur between two
sample dates aren't recorded e.g. it isn't possible to accurately calculate the
number of times that a person has used the PIE. The current 'lastactiondate' is
compared with the one recorded when the query was last run (which could have
been a few days previously). It isn't possible to tell whether the user has
used the PIE on more than one occasion between the two sample dates.
The
PIE database wasn't designed to be used for logging the number of times that
users use the PIE. The purpose of the lastactiondate/time is to work out if a
user needs to be asked to log on.
- The inability to correlate the information from the PIE database with
the resource database (rdb) - so that, (for example) it's not possible to
obtain a list of all resources that have not been accessed.
Log file analysis software - requirements specification
The purpose of log file analysis software would be to produce analyses
of user behaviour and of resource usage. These analyses could be used to inform
collection development decisions, decisions regarding the optimum number of
concurrent users (where licensing restrictions exist), effective marketing of
the PIE and of the resources that it provides access to. The analyses could
also be used to inform the forthcoming studies of PIE user behaviour and of PIE
performance (the latter will look at the technology specifications required to
support specified numbers of users).
The main advantage of log analysis software is that it would automate
the collection and collation of data and hence the administrative overheads
would be reduced. The log analysis software would include a wider range of
queries than are currently available. However a considerable amount of time
would be required to develop the software.
The log file analysis software would perform all of the current queries
and would also the include 'desirable functionality' relating to these queries
(detailed below). The forthcoming 'Library staff evaluation study' will seek to
inform on further desirable queries.
Current queries
1. Number of users and the number of times each user has used the
PIE
Data is collected from the following fields in the userdata
table: mail, lastactiondate
This particular query is run
approximately every few days. The data is manually listed according to user
type (e.g. student, academic, library staff, guest user, etc.). The total
number of users and the number of users within each type is manually counted.
The 'lastactiondate' indicates when the user last used the PIE. This is
manually highlighted if it has changed since the query was last run and hence
the approximate number of times that each user has used the PIE can be
recorded.
Desirable functionality
- An accurate record of how many times each user has logged into the
PIE.
- A record of the number of users contained within defined 'groups'
(e.g. each course/each department - using the groups that have been defined in
the rdb - these mirror the groups to which users are assigned by the
authentication broker).
[The log analysis software would need to be able to
differentiate users of different types and in different groups.]
2. The total number of resources
Data is collected from the following fields in the itemtable:
itemid filtered by type equals resource,
location
This query allows the total number of resources
contained within the PIE to be (manually) calculated. The resource location
will either be a location contained within the rdb or in the case of web-links,
the url. The number of rdb resources and web-links is manually
calculated.
3. Resources not owned by headliner
Data is collected from the following fields in the item table:
itemid filtered by type equals resource and ownerid does
not equal headliner
This query retrieves the itemid and ownerid of all
resources that are owned by PIE users other than headliner. The number of
non-headliner resources provides an indication of how actively the users have
been adding resources to their personal pages within their PIE. However, the
number of non-headliner resources can rise or fall because users are able to
both add and delete resources (and whole pages of resources).
Desirable functionality
- A regular 'statement' providing the total number of rdb resources and
web-links deleted and added by the group of non-administrative users and the
current balance of rdb resources and web-links owned by these users during the
specified period. The statement could also provide the average number of rdb
resources and web-links that the users have added to their page(s). It would be
useful to be able to generate a similar statement for administrative users and
by user type (e.g. undergraduate, postgraduate, academics) or group (e.g.
particular courses/departments).
The statement could also include the
following:
- A rank of user types (e.g. undergraduate, postgraduate,
academics) and of user groups (e.g. particular courses/departments) by the
number of resources that they have added. These rankings would enable patterns
to be identified - do particular types/groups of users add more resources than
others?
- A record of how many times each particular resource had been
accessed by the total group of PIE users and by particular groups/types of PIE
users (e.g. by course, department or by students, academics, etc.).
- A ranking of each rdb resource and each web-link by the number of
times that users had added them to their pages. This would indicate which
resources users most commonly add to their pages.
- A record of resources that had never been added to a page and a
record of resources that had never been accessed.
4. Number of pages
Data is collected from the following fields in the item table:
itemid filtered by type equals page, ownerid
This
query enables the manual calculation of the total number of pages, the number
of pages owned by users and the number of system (and headliner) pages. The
number of users that own more than one page is manually calculated.
Desirable functionality
- A record of the total and average number of pages owned by users of
each type and of each group.
- The number of users who own different numbers of pages (e.g. 1 page =
340 users, 2 pages = 20 users etc.)
Desirable Queries
1. List creation
- The total and average number of lists created by users of each type
and user group.
- The average number of lists created per page.
2. Search behaviour
- The total and average number of rdb searches run during a particular
period.
- The total and average number of rdb searches run per user type and
per user group.
- A log of search terms used.
- A record detailing the number of unsuccessful searches and the search
terms and parameters used. In order to find out whether the PIE is meeting the
users needs (in terms of the resources that it contains) we could look at what
users had been searching for when no results were returned to them.
If the
number of targets available for cross-searching were to be increased:
- The number of cross-searches that users of particular types and
groups run ranked by the number of targets per search.
3. Use of specific pages, including Help and Customisation pages
- The total and average number of times that users of each type and
user group have accessed each of the pages within the PIE.
- The total and average number of times that particular help topics are
accessed.
4. Use of discussion areas (if this feature were to be
implemented)
- The total and average number of times that users of each type and
group have used the discussion areas.
5. Use of an integrated EEDD (if this feature were to be integrated
within the PIE)
- The total and average number of times that users of each type and
group have used the EEDD.
Interface description
The administrator will firstly authenticate themselves. All possible
queries (detailed above) will then be listed. The administrator will be able
to: select which of these pre-defined queries are to be run and how frequently;
to generate either a compete report for a specified period (including data from
all possible queries), or to generate a report containing the data from
selected queries. The administrator will be able to specify how often a
particular report should be run. It will be possible for the administrator to
print out reports, to save reports and to download data from reports into an
Excel spreadsheet.
Processing a log file is likely to be a big task which requires lots of
resources; what we did in Decomate was to take a month's worth of log file and
store it separately, and process the information to be put into a database.
This process took several hours for each log file, though Decomate is much more
heavily used than the PIE. Accessing more than a month's worth of data from
Decomate's database proved to be impossible for Excel for memory reasons. We
probably need to do something along the lines of storing each query's results
when it is run and then using Excel to combine them; this would be much quicker
once the log file for a particular period has been processed.