|
|
The IMesh Toolkit
[ Work In Hand > Technical Review > Information Sources ]
Cheshire II: Design and Evaluation of a Next-Generation
Online Catalog System: a digest
Ray R. Larson, Ralph Moon, Jerome McDonough, Lucy Kuntz, Paul
O'Leary, 1995
Original
Article
Introduction
This document gives an interesting perspective on the problems
that earlier generations of on-line catalogue systems have
experienced and the solutions that Cheshire II is hoping to
implement.
The principal failings lie in the area of topical searching
where users can experience both search failure and information
overload. The Cheshire II system design intended to address these
problems comprises a client/server architecture, its search
engine employing the Z39.50 Information Retrieval Protocol. It is
being tested in a library environment in the UC Berkeley
Astronomy-Mathematics-Statistics Library. This paper details how
the latter is involved in the project's evaluation strategy.
The paper begins by highlighting the difficulties experienced
in topical searching in catalogues and that studies state 30-50%
of subject searches fail completely. The chief reasons are given.
On the other hand, the high density of information in a given
topical area and the increase in items indexed by a given keyword
or subject heading are reasons for the occurrence of information
overload, i.e. unmanageable lists of search results.
Conclusions
Whilst the authors' conclusions may seem more sensibly mentioned
at the end of this summary, they are placed at this point to
highlight the issues confronting developers and the solutions
that the authors feel are essential in their design.
The conclusions list the failings identified in first- and
second-generation on-line catalogue systems under three
categories:
- Search failure:
poor translation of user query terms against the catalogue's own
vocabulary
poor response to search failure:
little or no help in providing alternate search formulation
failure to execute different search methods
- Failure to provide further material:
failure to lead the user to corresponding subject headings or
class numbers of a broader range of related materials even after
a successful free text search
failure to facilitate open-ended browsing or create
pre-established links between database records
- Information overload:
failure to give descending order of probable relevance in search
results
failure to provide an on-line thesaurus for subject focusing and
topic/treatment discrimination
failure of bibliographic records that are retrieved to provide
user with sufficient information on which to judge those records'
usefulness.
Summary of Cheshire II Design Features
Cheshire II is claimed as a third-generation system which makes
an important advance over existing systems. Core design features
include SGML as the primary data base format of the search
engine; a client/server application communicating via Z39.50.
Both Boolean and probabilistic "best match" ranked searching is
supported. It also supports browsing via automatically-produced
hypertext links.
Cheshire II and SGML
The paper claims that SGML provides the flexibility to retrieve
both highly structured text as in MARC records whilst permitting
the retrieval of text in say journal articles. This is because
all text records on Cheshire II are stored as tagged SGML text.
This permits great flexibility in generating search indexes as
well as considerable ease in searching full text articles. SGML
could also permit extensions such as browsing within one document
and the handling of different character sets to enhance records
in languages other than English. The authors also envisage the
possibility of incorporating other information in SGML records to
allow less directed searching than probabilistic and Boolean
searches do, for example, citation-chain searching via hypertext
linking of documents.
The Search Engine
The paper also indicates that Cheshire II search engine's
advanced retrieval techniques have overcome the problem of
typically limited topical information in MARC records. This has
been achieved by automatically grouping terms derived from the
same area of classification. This version of Cheshire supports
various search and browsing capabilities, the storage and
retrieval of results sets and free text queries. Another
important feature is the conversion of the user's search terms to
the vocabulary of the database being searched, including support
for field-specific stopword lists and query-to-key conversion
functions as well as stemming algorithms.
The search engine prototype has adopted a two-stage search
method where probabilistic "best match" techniques match the
user's initial topical query against a set of classification
clusters which are retrieved in descending order of probable
relevance. This aids the user's subject focusing. Also supported
is the direct probabilistic searching of any indexed field of an
SGML record. The probabilistic ranking method is based on staged
logistical regression algorithms as developed by Berkeley
researchers.
Searching Methods
The paper goes on to explain the value of combined Boolean and
probabilistic searching which are two parallel logical search
engines despite being implemented in one algorithm. Since no
single retrieval algorithm is markedly better than another, the
two methods working in tandem exploit each other's strengths. In
short, the system provides more evidence which enhances
probability ranking accuracy.
Merging Results
Whilst research says users are unimpressed by the differing
nature of results based on Boolean and probabilistic searches, it
is important to remember they are often unrelated ; this is a
major consideration when merging the results sets following a
combined search. The merging algorithm which achieves this
contains coefficients which assign relative values to the
differing sets. In general terms the order of precedence goes:
Boolean known item searches, title or abstract keyword searches
or probabilistic enquiries and lastly, full text keyword
searches.
Browsing and Relevance Feedback
The system also supports browsing via dynamically generated links
between database records. This "nearest neighbour" search means
that the user selects a citation or document he or she has seen
which is fed back into the probabilistic search algorithm. The
latter then identifies the records in the database closest to the
one specified. The GUI supports dynamically generated hypertext
links from elements in a given record, (e.g. subject headings) to
other records through automatic Boolean query formulation.
The Client Interface
Whilst addressing the usual difficulties encountered in graphical
user interface development, this paper also examines those
generated by the need to develop a client capable of exploiting
the full range of functionality across a variety of servers in a
distributed network. It suggests a set of requirements which
include :
- supporting the user in the correct initial selection of an
appropriate resource
- providing a variety of mechanisms for specifying a query
across a variety of search engines
- providing a variety of display formats for the different
types of documents retrieved
- permitting the user the ability to print, save and forward
retrieved record sets
- permitting the user to respond to resource control
messages
- ensuring sensible error message handling
The authors promote two guiding principles in the GUI
development: keep the client's functionality highly visible
irrespective of the server in operation and keep change in
interface to a minimum when moving from server to server. They
illustrate these requirements and principles with a range of
screen dumps of GUI's.
The Project Evaluation Plan
The Project Evaluation Plan is based on two methods of evidence
gathering, a quantitative approach through transaction monitoring
and a qualitative approach through on-line questionnaires for
local and remote users. It will investigate how the system will
be used across differing social groups. It will consider specific
questions such as whether trends in searching will shift whilst
determining the degree of user satisfaction over the range of
types of search. It will seek to determine whether the design of
the GUI has an influence on searching trends and user
satisfaction as well as users' accuracy of search.
|
|