PIE Current Awareness Services - Technical Possibilities
Technical Requirements
The purpose of this particular aspect of the PIE's current awareness service is
to allow users to monitor web sites that they themselves have input into the
PIE. This is separate from CAS functions that might be implemented into the PIE
which are concerned with items in the Resource Database or with saved searches
of other resources. (These functions have to some extent been implemented
already.)
Requirement List
- (Necessary) The communication between monitoring software and end user
needs to be mediated by the PIE software.
- (Necessary) The monitoring software needs to be able to work with an
arbitrary number of URLs (for an external service, it makes sense to just
register a single user for the PIE).
- (Useful) The ability to compare the current HTML of a page against a
previously downloaded copy rather than just relying on the page modification
information available via HTTP. (This makes the service much more efficient,
even if it makes major demands on it.)
- (Useful) The ability to filter out particular tags so that modifications
involving these tags can be ignored.
- (Useful) The ability to use at least some authentication information to
access pages.
Problems With Monitoring
- Webservers should return information about a requested page saying when it was
last modified; this is what you see in Netscape's Page Info. They also should
accept requests which are conditional on the page being modified since a
particular date/time (see HTTP
RFC, sections 14.25 and 14.29). The most naive content checkers will just
ask for the page information if it has been modified since the last date that
the site was visited/checked; this is how Netscape's update bookmarks feature
works. However, some web servers still don't do this correctly, and return
unknown for the date which makes this not work. This problem can be overcome by
a monitor which compares the current HTML to a saved copy.
- Many websites are poorly designed for current awareness, encouraging users
to bookmark a front page which never changes. In this case, current awareness
monitoring of the site will only work if it has a recent updates list somewhere,
and this is the page that is monitored. Yahoo
is an example of this - updates are at What's
New; this is true of many sites which act as gateways. There is no way to
solve this problem.
- Where the user needs to authenticate to a Website, the monitor will not be
able to keep track of the content without information about access. This is more
complicated than just knowing the username and password to use to gain access,
since tokens can be contained in cookies, or submitted via Web forms rather than
through the use of the HTTP authentication mechanism. In the first case, the
monitor will not receive the same information as the user, and in the second,
complex configuration is likely to be needed. (This is a problem which will be
looked at in more detail when considering seamless access to resources via the
PIE in general.)
- When a page contains scripted information - database lookups, complex
Java, Flash animations, etc - the content can change without the page
changing. (This is a similar issue to the caching problems in the PIE.)
Then there is no way that current awareness software can pick this up without
storing a copy of the page as rendered by a browser, and this just isn't
possible (as, for example, scripts might render things differently depending on
the browser type or user location). A lot of pages still have information
contained in graphics without textual equivalents, and so the graphics can
change without the page apparently changing; this is becoming less common as
awareness of the needs of search engines and disabled users becomes more
widespread. These are general usability issues for many web surfers; see Jakob
Nielsen's Alertbox columns for more
information about the issues.
External Candidates
Google's directory provides a
useful list of candidates. Such a list is much more difficult to identify on
most search engines, which confuse current awareness software with content
filtering (to prevent access to pornography etc.) software.
External services
Some of the available services seem to monitor search engine results rather than
specific URLs; examples include TracerLock. Slightly more
advanced, Karnak allows monitoring of the pages
returned in a search engine's results as well as just checking to see if the
results list itself has changed.
NetMind's Mind-it service is aimed at Web
masters rather than end users, so that they can put a box on their pages and
allow users to monitor updates. It uses such features as special tags to put
round HTML to be ignored by the monitoring software, the ability to send
customised messages to users as part of their alert emails. This is not
therefore going to be a possibility for use through the PIE.
The Informant is more like the kind
of service we will want, with email notification of changes, but only allows
five URLs per user and this makes it impossible.
EoMonitor, used by Daily Diffs, allows up to 200 URLs in a
free account, with as many as desired in a paid account. The free account
doesn't send email alerts. Daily Diffs is a database of the 40,000 most
requested URLs from EoMonitor, with listings of changes. Basically, it monitors
the HTML of a page for changes. The user sees a graphic with a schematic
representation of changes in the page's HTML, as here. This would mean that it
wouldn't monitor a page connected to a database or requiring user authentication
correctly, and that it doesn't really filter out unimportant changes.
Spyonit has three kinds of change that
it will notify the user about: when it changes in any way, when a phrase is
added, or when it is removed. It also allows the entry of a username and
password for pages requiring authentication.
The feature that really makes Spyonit stand out, is Spybuilder, which allows a site like the
PIE to customise exactly what is monitored about external sites, including
features where PERL-style regular expressions are used to filter out information
that is not to be monitored (such as background colour changes). The format of
the notification message can also be specified (short, normal, detailed;
ASCII text/HTML), which aids in the parsing of the message by PIE software.
Locally Installable Software
CheckURL is a simple
script which checks the header information from the HTTP server, which should
contain the date of last modification for a page. This can be incorrect for a
wide variety of reasons (detailed above).
Web Secretary is more
sophisticated, checking a saved version of each page against the current one
(the alert messages it sends contain the full HTML of the new page with changes
highlighted) and having some scope for HTTP authentication. It is also written
in perl, which will make it easier to integrate with the PIE.
Conclusion
In general, locally installed software is a more useful option for PIE
components than reliance on external products. This is because it is easier to
integrate it into existing and future PIE functionality (e.g. seamless access to
get round authentication problems) and because with external services we are
dependent on the format of the service interface not changing.
However, in this case, the possible candidates which consist of installed
software do not seem to be as mature as the external services which are
available, and particular that offered by Spyonit/Spybuilder. Currently, this
seems to be the best option. In the long term, though, customising Web Secretary
is probably the most sensible way to add web monitoring functionality to the
PIE.