Back to IMesh Toolkit Home Page
Back to IMesh Toolkit Homepage
Subject Gateway Requirements
Technology Review
Work In Hand
  Personalization
Annotation
Reading Lists
OAI  Normalization tools
Metadata Exchange
RDF queries
Evaluation
Dissemination
Project Documentation
Related Links
Project Partners
IMesh Home Page

The IMesh Toolkit

[ Work In Hand > Technical Review > Information Sources ]

Cheshire II: Design and Evaluation of a Next-Generation Online Catalog System: a digest




Ray R. Larson, Ralph Moon, Jerome McDonough, Lucy Kuntz, Paul O'Leary, 1995 Original Article
Introduction
This document gives an interesting perspective on the problems that earlier generations of on-line catalogue systems have experienced and the solutions that Cheshire II is hoping to implement.

The principal failings lie in the area of topical searching where users can experience both search failure and information overload. The Cheshire II system design intended to address these problems comprises a client/server architecture, its search engine employing the Z39.50 Information Retrieval Protocol. It is being tested in a library environment in the UC Berkeley Astronomy-Mathematics-Statistics Library. This paper details how the latter is involved in the project's evaluation strategy.

The paper begins by highlighting the difficulties experienced in topical searching in catalogues and that studies state 30-50% of subject searches fail completely. The chief reasons are given. On the other hand, the high density of information in a given topical area and the increase in items indexed by a given keyword or subject heading are reasons for the occurrence of information overload, i.e. unmanageable lists of search results.

Conclusions
Whilst the authors' conclusions may seem more sensibly mentioned at the end of this summary, they are placed at this point to highlight the issues confronting developers and the solutions that the authors feel are essential in their design.

The conclusions list the failings identified in first- and second-generation on-line catalogue systems under three categories:
  1. Search failure:
    poor translation of user query terms against the catalogue's own vocabulary
    poor response to search failure:
    little or no help in providing alternate search formulation
    failure to execute different search methods
  2. Failure to provide further material:
    failure to lead the user to corresponding subject headings or class numbers of a broader range of related materials even after a successful free text search
    failure to facilitate open-ended browsing or create pre-established links between database records
  3. Information overload:
    failure to give descending order of probable relevance in search results
    failure to provide an on-line thesaurus for subject focusing and topic/treatment discrimination
    failure of bibliographic records that are retrieved to provide user with sufficient information on which to judge those records' usefulness.


Summary of Cheshire II Design Features
Cheshire II is claimed as a third-generation system which makes an important advance over existing systems. Core design features include SGML as the primary data base format of the search engine; a client/server application communicating via Z39.50. Both Boolean and probabilistic "best match" ranked searching is supported. It also supports browsing via automatically-produced hypertext links.

Cheshire II and SGML
The paper claims that SGML provides the flexibility to retrieve both highly structured text as in MARC records whilst permitting the retrieval of text in say journal articles. This is because all text records on Cheshire II are stored as tagged SGML text. This permits great flexibility in generating search indexes as well as considerable ease in searching full text articles. SGML could also permit extensions such as browsing within one document and the handling of different character sets to enhance records in languages other than English. The authors also envisage the possibility of incorporating other information in SGML records to allow less directed searching than probabilistic and Boolean searches do, for example, citation-chain searching via hypertext linking of documents.

The Search Engine
The paper also indicates that Cheshire II search engine's advanced retrieval techniques have overcome the problem of typically limited topical information in MARC records. This has been achieved by automatically grouping terms derived from the same area of classification. This version of Cheshire supports various search and browsing capabilities, the storage and retrieval of results sets and free text queries. Another important feature is the conversion of the user's search terms to the vocabulary of the database being searched, including support for field-specific stopword lists and query-to-key conversion functions as well as stemming algorithms.

The search engine prototype has adopted a two-stage search method where probabilistic "best match" techniques match the user's initial topical query against a set of classification clusters which are retrieved in descending order of probable relevance. This aids the user's subject focusing. Also supported is the direct probabilistic searching of any indexed field of an SGML record. The probabilistic ranking method is based on staged logistical regression algorithms as developed by Berkeley researchers.

Searching Methods
The paper goes on to explain the value of combined Boolean and probabilistic searching which are two parallel logical search engines despite being implemented in one algorithm. Since no single retrieval algorithm is markedly better than another, the two methods working in tandem exploit each other's strengths. In short, the system provides more evidence which enhances probability ranking accuracy.

Merging Results
Whilst research says users are unimpressed by the differing nature of results based on Boolean and probabilistic searches, it is important to remember they are often unrelated ; this is a major consideration when merging the results sets following a combined search. The merging algorithm which achieves this contains coefficients which assign relative values to the differing sets. In general terms the order of precedence goes: Boolean known item searches, title or abstract keyword searches or probabilistic enquiries and lastly, full text keyword searches.

Browsing and Relevance Feedback
The system also supports browsing via dynamically generated links between database records. This "nearest neighbour" search means that the user selects a citation or document he or she has seen which is fed back into the probabilistic search algorithm. The latter then identifies the records in the database closest to the one specified. The GUI supports dynamically generated hypertext links from elements in a given record, (e.g. subject headings) to other records through automatic Boolean query formulation.

The Client Interface
Whilst addressing the usual difficulties encountered in graphical user interface development, this paper also examines those generated by the need to develop a client capable of exploiting the full range of functionality across a variety of servers in a distributed network. It suggests a set of requirements which include :
  1. supporting the user in the correct initial selection of an appropriate resource
  2. providing a variety of mechanisms for specifying a query across a variety of search engines
  3. providing a variety of display formats for the different types of documents retrieved
  4. permitting the user the ability to print, save and forward retrieved record sets
  5. permitting the user to respond to resource control messages
  6. ensuring sensible error message handling
The authors promote two guiding principles in the GUI development: keep the client's functionality highly visible irrespective of the server in operation and keep change in interface to a minimum when moving from server to server. They illustrate these requirements and principles with a range of screen dumps of GUI's.

The Project Evaluation Plan
The Project Evaluation Plan is based on two methods of evidence gathering, a quantitative approach through transaction monitoring and a qualitative approach through on-line questionnaires for local and remote users. It will investigate how the system will be used across differing social groups. It will consider specific questions such as whether trends in searching will shift whilst determining the degree of user satisfaction over the range of types of search. It will seek to determine whether the design of the GUI has an influence on searching trends and user satisfaction as well as users' accuracy of search.