Back to IMesh Toolkit Home Page
Back to IMesh Toolkit Homepage
Subject Gateway Requirements
Technology Review
Work In Hand
  Personalization
Annotation
Reading Lists
OAI  Normalization tools
Metadata Exchange
RDF queries
Evaluation
Dissemination
Project Documentation
Related Links
Project Partners
IMesh Home Page

The IMesh Toolkit

[ Work In Hand > Technology Review > Standards and Protocols ]

WHOIS++ Search Protocol

Overall Purpose

Based on Whois, a rather restricted white pages directory service, Whois++ is a lightweight extension offering cross-searching over a distributed network of databases including multiple gateways. It is designed to function as a simple lookup service but with a degree of flexibility that avoids imposing constraints upon developers.

In its evolution beyond the original Whois, this protocol has acquired more advanced search processes enhanced by the addition of global and local constraints and the use of Boolean operators. Further options include languages other than English, additional character sets and, most importantly, the use of structured data to make searching more effective. The structured data is in the form of an information template which is central to the Whois++ operation.

Whereas the relatively restricted information in the Whois system was unstructured, a record in this system is set out as an information set, i.e. a structured template of organised data elements (or attribute-value pairs). The purpose of Whois++ is to render searching relatively simple and a major contribution towards this is made through the use of different types of template in order to categorise the data set each contains. This means that searches can be effected more quickly if the type of template is included in the search terms, thereby eliminating from the search process all templates and therefore records which do not conform to the chosen template type. Each record is identifiable by a combination of two handles or unique identifiers, that of the server which holds the record and the identifier of the record itself.

The IAFA template which has gained in recognition for its organisation of data is compatible with the Whois++ search protocol, and was first implemented by the ALIWEB search system. [1] Without some form of forward knowledge the entire network of databases could, theoretically, be interrogated before the desired record is located. The purpose of forward knowledge is to reduce the number of interrogations that have to be made by providing the searcher with a notion of where best to look for the information.

The method by which Whois++ achieves forward knowledge is by the generation of centroids, which are indexed across the system. A centroid is a set of word lists, one for each attribute existing in the database. Each list holds all the unique words occurring in the values for that attribute. [2] How a centroid is formed from a set of records is best illustrated as follows:
Record 1                      Record 2                Record 3
Template:Person               Template:Person         Template:Domain
Last_Name: Brown              Last_Name:Brown         Domain_Name:evans.edu
First_Name:George             First_Name:James        Contact_Name: Brian Evans
Transport_Type:Company        Transport_Type:Company
       Car                           Rail Pass
The centroid from this simple example will form from all the attributes whilst eliminating any duplicated values of those attributes. The resultant centroid would therefore look thus:
Centroid of Records 1-3:

Template:Person
Last_Name:Brown
First_Name: George
            James
Transport_Type:Company
                      Car
                      Rail Warrant

Template:Domain
Domain_Name:evans.edu
Contact_Name: Brian Evans

Note that the centroid eliminates those values which are duplicate, e.g. "Brown" and "Company".

The organisation of the distributed network is comprised of two basic kinds of servers, base servers, (which hold only filled templates),[3], and indexing servers; whilst both types of server generate forward knowledge, it is the indexing servers which collect, retain and pass forward knowledge in the form of centroids to the user about the servers thought to be holding the desired records. Such a routing is called a referral. It is possible for both base and indexing server to exist in one physical server.

How the passing of centroids assists the search process is described in the next section.

Brief Overview of Functionality

This section divides into the process by which indexing produces referrals to assist speedier searching and the functionality of Whois++ in terms of a user.

Indexing servers obtain centroids from other servers at regular intervals by a process called polling. If a server is polled by an indexing server, it determines whether it has any changes in its centroids to report and can even inform other polling indexers via the DATA-CHANGED command which allows 'inter- active' updating of critical information, [4].

If the client initially fails to locate a record, it can adopt the normal approach of automatically expanding the search across the mesh. It is for the client to keep track of the servers queried in this process with a loop detection algorithm. [5] However an alternative approach exists in the form of a special server termed the Directory of Servers. This polls all other indexing servers for common information and so shortens the referral process.

A typical transaction might occur as follows: The client opens a connection to the server, sends a query, receives a reply, most likely a referral to another server and the connection closes. [5]

For the user there is basically one search command that may be modified by constraints attaching to it. Any search command may contain more than one search term which in itself may be further locally constrained (e.g. by the use of Boolean operators against a value, such as NOT). The core set of constraints are SEARCH, to determine search type, FORMAT to determine the output format and MAXHITS to set the maximum number of matches to be returned. [3]

The user has a small number of system commands with which to interrogate a server as to the types of templates it holds, the constraints it supports and the servers it has polled or which have tracked that server.

Deployment

Whois++ does not rely on a hierarchical representation of data space [6] and permits a more flexible approach to cross-searching. Its simplicity avoids the imposition of constraints upon its use in other information service areas.

Just as its simplicity is likely to encourage interest, its status as open source lowers the hurdle of entry cost for interested parties.

Its close relationship to the Common Indexing Protocol is a major strength of Whois++. Indeed CIP was embedded in version 1 of Whois++ and came to be abstracted from it in subsequent versions. [7] Consequently there is a particularly close mapping whereas other pre-existing protocols may require more work to collaborate with CIP. Equally if the Whois++ handle is substituted by the DSI, (Dataset Identifier), the original Whois++ mesh traversal algorithm [5] can operate unchanged with CIP.

Conversely it might be argued that Whois++ is not sufficiently sophisticated to offer a wider range of search tactics. Equally such is its simplicity of operation, for example, that a client has to be programmed in order to be able to follow all query referrals automatically. [7]

However in practical terms the take-up on Whois++ is not widespread and some contend it may be eclipsed by the rise of XML,XQL and Common Name Resolution Protocol. [10]

Moreover whilst Whois++ addresses authentication in the sense that it does provide a framework for the process, it does not extend beyond a simple login name and password operation. This may be sufficient for some uses but it will represent a limitation in certain environments where there might be a need for access control lists for different entries in the database. Equally the protocol possesses no provision for encryption. [2]

Related Standards

Z39.50 [UKOLN Z39.50 review]

Z39.50 is a powerful searching tool using a generalised search syntax. It has the capacity to facilitate distributed applications and retrieve structured data from remote heterogenous databases. In its terminology it describes a client as an "origin" and a server as a "target" (of the origin's requests). Origin and target communicate by PDU's, (Protocol Data Units), which largely operate in pairs, request and response.

Its key functionality can be summarised as a series of services: Initialisation : which seeks to set up the association between origin and target.

Search : involving the passing of a Z39.50 compliant query and its use to search a database any subsequent storage of any results. It should be noted that the target sends not the records themselves to the origin but details of those records.

Present : this permits the origin or user to request those records or a subset of them.

Authentication, record deletion and the use of resources within the current client/server dialogue are also addressed. Later versions include Explain, which permits the client to retrieve information on server-side components, and the retention of current results for later use.

Z39.50 is not without its difficulties. While it would no longer be entirely true to characterise the development of Z39.50 as divided across two continents, North America and Europe and across two environments, TCP/IP (Transmission Control Protocol/Internet Protocol) and OSI (Open Systems Interconnection) respectively. However, earliest implementations of Z39.50 were thus divided and suffered from a lack of interoperability as a result. Part of the motivation of the EUROPAGATE Project was to create relevant solutions to the problem.

Other projects and initiatives in a European context are IRIS, a functioning service in Eire and projects such as DALI (Document and Library Integration), funded by European Libraries Programme, Pica, (Holland). Of note are also SOCKER (SR Origin Communication Kernel)and PARAGON, both coordinated by UNI-C, (Denmark), ONE (OPAC Network in Europe) and a German national project DBV-OSI II. Z39.50 is employed commercially in the following products: Index+, SiteSearch and MetaStar and as freeware in ASF, Cheshire II and Isite.

Despite complaints that Z39.50 can be long and costly to implement, it is seen by its supporters as an application protocol that is capable of "gluing" together the various components of a distributed network architecture whether characterised by the MODELS Information Architecture or other systems. [8]

LDAP [UKOLN LDAP review]

The Lightweight Directory Access Protocol evolved to meet the need for a less bulky and resource-consuming alternative to the X.500 Directory Access Protocol. It can run directly on top of TCP/IP and employs simpler encoding than X500.DAP. It could be argued that interest has waned in the protocol since the appearance of more powerful PC's but this would be an over-simplification for LDAP has regained a degree of acceptance and some users report significant activity with it. In its purest form, an LDAP scenario greatly resembles Whois++ in its generation of referrals to likeliest servers for the user. LDAP is employed by the ISAAC Project based at University of Wisconsin-Madison.

Relevance to IMesh context

Its major relevance lies in its employment in systems already associated with this project, e.g. ROADS, Harvest and MetaWeb, and so has some performance history in addressing the needs of the project, namely to provide cross- searching across distributed networks.

It is worthy of note that Whois++ is in use in a European context in the sense that it used for cross-searching in the ROADS system in U.K. services such as SOSIG and OMNI. Furthermore it currently forms the basis of the resource finder in RDN. It is most prominent elsewhere in European associated projects in the Finnish Virtual Library where it uses the ROADS (v2) software in conjunction with CIP and encompasses 5 FVL gateways across a very wide range of disciplines.

However as the functionality required in the Renardus project becomes more apparent it is possible doubt will arise over the extensibility of Whois++ as a protocol which has not seen a great deal of work on it recently. (Although Patrik Faltstrom and Leslie Daigle published an Internet-Draft in mid-June 2000 regarding the expression of Whois++ protocol [3] queries within MIME [9]) media types. Their intention is to enable MIME-enabled mail software, and other systems using Internet media types, to carry out Whois++ transactions. [9]

References

[1] A review of metadata: a survey of current resource description formats, Work Package 3 of Telematics for Research project DESIRE(RE004):IAFA/WHOIS++Templates

http://www.ukoln.ac.uk/metadata/desire/overview/

[2] CNIDR (Clearinghouse for Networked Information Discovery and Retrieval):"Distributed Directory Services Based on the Whois++ Protocol"

http://dcas.ucdavis.edu/projects/whois/prop.html#chapter1

[3] RFC 1835, 1995, Architecture of the WHOIS++ service. (P. Deutsch, R. Schoultz, P. Faltstrom and C. Weider). Internet Engineering Task Force, Network Working Group, August.

http://www.ietf.org/

[4] RFC 1913, 1996, Architecture of the Whois++ Index Service. (C. Weider, J. Fullton and S. Spero). Internet Engineering Task Force, NetworkWorking Group, February.

http://www.ietf.org/

[5] RFC 1914, 1996, How to interact with a Whois++ Mesh. (P. Faltstrom, R. Schoultz and C. Weider).Internet Engineering Task Force, Network Working Group, February.

http://www.ietf.org/

[6] DESIRE Handbook: Section 3, Technical implementation: Interoperability,

http://www.ukoln.ac.uk/metadata/desire/handbook/drafts/standards/

[7] RFC: 2651 The Architecture of the Common Indexing Protocol (CIP), J. Allen, M.Mealling

http://www.ietf.org/ [check RFC2561 for copyright notice]

[8] "Program" Vol 30, No 1, January 1996 : Towards distributed library systems: Z39.50 in a European context, Lorcan Dempsey, Rosemary Russell and John Kirriemuir

http://www.aslib.co.uk/program/1996/jan/02.html

[9] "The application/whoispp-query Content-Type", Patrik Faltstrom, Leslie Daigle, 06/13/2000

http://www.ietf.org/internet-drafts/draft-daigle-wppquery-02.txt

[10] Martin Hamilton, imesh-toolkit mailbase archive 12 June 2000

Other Standards and Protocols

CIP DC LDAP OAI
RDF RSS SDLIP SOAP
WHOIS++ XHTML XML Z39.50