|
|
The IMesh Toolkit
[ Work In Hand > Technology Review > Standards and
Protocols ]
WHOIS++ Search Protocol
|
Overall Purpose
|
Based on Whois, a rather restricted white
pages directory service, Whois++ is a lightweight extension
offering cross-searching over a distributed network of databases
including multiple gateways. It is designed to function as a
simple lookup service but with a degree of flexibility that
avoids imposing constraints upon developers.
In its evolution beyond the original Whois, this protocol has
acquired more advanced search processes enhanced by the addition
of global and local constraints and the use of Boolean operators.
Further options include languages other than English, additional
character sets and, most importantly, the use of structured data
to make searching more effective. The structured data is in the
form of an information template which is central to the Whois++
operation.
Whereas the relatively restricted information in the Whois
system was unstructured, a record in this system is set out as an
information set, i.e. a structured template of organised data
elements (or attribute-value pairs). The purpose of Whois++ is to
render searching relatively simple and a major contribution
towards this is made through the use of different types of
template in order to categorise the data set each contains. This
means that searches can be effected more quickly if the type of
template is included in the search terms, thereby eliminating
from the search process all templates and therefore records which
do not conform to the chosen template type. Each record is
identifiable by a combination of two handles or unique
identifiers, that of the server which holds the record and the
identifier of the record itself.
The IAFA template which has gained in recognition for its
organisation of data is compatible with the Whois++ search
protocol, and was first implemented by the ALIWEB search system.
[1] Without some form of forward knowledge the entire network of
databases could, theoretically, be interrogated before the
desired record is located. The purpose of forward knowledge is to
reduce the number of interrogations that have to be made by
providing the searcher with a notion of where best to look for
the information.
The method by which Whois++ achieves forward knowledge is by
the generation of centroids, which are indexed across the system.
A centroid is a set of word lists, one for each attribute
existing in the database. Each list holds all the unique words
occurring in the values for that attribute. [2] How a centroid is
formed from a set of records is best illustrated as follows:
Record 1 Record 2 Record 3
Template:Person Template:Person Template:Domain
Last_Name: Brown Last_Name:Brown Domain_Name:evans.edu
First_Name:George First_Name:James Contact_Name: Brian Evans
Transport_Type:Company Transport_Type:Company
Car Rail Pass
The centroid from this simple example will form from all the
attributes whilst eliminating any duplicated values of those
attributes. The resultant centroid would therefore look thus:
Centroid of Records 1-3:
Template:Person
Last_Name:Brown
First_Name: George
James
Transport_Type:Company
Car
Rail Warrant
Template:Domain
Domain_Name:evans.edu
Contact_Name: Brian Evans
Note that the centroid eliminates those values which are
duplicate, e.g. "Brown" and "Company".
The organisation of the distributed network is comprised of
two basic kinds of servers, base servers, (which hold only filled
templates),[3], and indexing servers; whilst both types of server
generate forward knowledge, it is the indexing servers which
collect, retain and pass forward knowledge in the form of
centroids to the user about the servers thought to be holding the
desired records. Such a routing is called a referral. It is
possible for both base and indexing server to exist in one
physical server.
How the passing of centroids assists the search process is
described in the next section.
|
Brief Overview of Functionality
|
This section divides into the process by
which indexing produces referrals to assist speedier searching
and the functionality of Whois++ in terms of a user.
Indexing servers obtain centroids from other servers at
regular intervals by a process called polling. If a server is
polled by an indexing server, it determines whether it has any
changes in its centroids to report and can even inform other
polling indexers via the DATA-CHANGED command which allows
'inter- active' updating of critical information, [4].
If the client initially fails to locate a record, it can adopt
the normal approach of automatically expanding the search across
the mesh. It is for the client to keep track of the servers
queried in this process with a loop detection algorithm. [5]
However an alternative approach exists in the form of a special
server termed the Directory of Servers. This polls all other
indexing servers for common information and so shortens the
referral process.
A typical transaction might occur as follows: The client opens
a connection to the server, sends a query, receives a reply, most
likely a referral to another server and the connection closes.
[5]
For the user there is basically one search command that may be
modified by constraints attaching to it. Any search command may
contain more than one search term which in itself may be further
locally constrained (e.g. by the use of Boolean operators against
a value, such as NOT). The core set of constraints are SEARCH, to
determine search type, FORMAT to determine the output format and
MAXHITS to set the maximum number of matches to be returned.
[3]
The user has a small number of system commands with which to
interrogate a server as to the types of templates it holds, the
constraints it supports and the servers it has polled or which
have tracked that server.
|
Deployment
|
Whois++ does not rely on a hierarchical
representation of data space [6] and permits a more flexible
approach to cross-searching. Its simplicity avoids the imposition
of constraints upon its use in other information service areas.
Just as its simplicity is likely to encourage interest, its
status as open source lowers the hurdle of entry cost for
interested parties.
Its close relationship to the Common Indexing Protocol is a
major strength of Whois++. Indeed CIP was embedded in version 1
of Whois++ and came to be abstracted from it in subsequent
versions. [7] Consequently there is a particularly close mapping
whereas other pre-existing protocols may require more work to
collaborate with CIP. Equally if the Whois++ handle is
substituted by the DSI, (Dataset Identifier), the original
Whois++ mesh traversal algorithm [5] can operate unchanged with
CIP.
Conversely it might be argued that Whois++ is not sufficiently
sophisticated to offer a wider range of search tactics. Equally
such is its simplicity of operation, for example, that a client
has to be programmed in order to be able to follow all query
referrals automatically. [7]
However in practical terms the take-up on Whois++ is not
widespread and some contend it may be eclipsed by the rise of
XML,XQL and Common Name Resolution Protocol. [10]
Moreover whilst Whois++ addresses authentication in the sense
that it does provide a framework for the process, it does not
extend beyond a simple login name and password operation. This
may be sufficient for some uses but it will represent a
limitation in certain environments where there might be a need
for access control lists for different entries in the database.
Equally the protocol possesses no provision for encryption.
[2]
|
Related Standards
|
Z39.50 [UKOLN Z39.50 review]
Z39.50 is a powerful searching tool using a generalised search
syntax. It has the capacity to facilitate distributed
applications and retrieve structured data from remote
heterogenous databases. In its terminology it describes a client
as an "origin" and a server as a "target" (of the origin's
requests). Origin and target communicate by PDU's, (Protocol Data
Units), which largely operate in pairs, request and response.
Its key functionality can be summarised as a series of
services: Initialisation : which seeks to set up the association
between origin and target.
Search : involving the passing of a Z39.50 compliant query and
its use to search a database any subsequent storage of any
results. It should be noted that the target sends not the records
themselves to the origin but details of those records.
Present : this permits the origin or user to request those
records or a subset of them.
Authentication, record deletion and the use of resources
within the current client/server dialogue are also addressed.
Later versions include Explain, which permits the client to
retrieve information on server-side components, and the retention
of current results for later use.
Z39.50 is not without its difficulties. While it would no
longer be entirely true to characterise the development of Z39.50
as divided across two continents, North America and Europe and
across two environments, TCP/IP (Transmission Control
Protocol/Internet Protocol) and OSI (Open Systems
Interconnection) respectively. However, earliest implementations
of Z39.50 were thus divided and suffered from a lack of
interoperability as a result. Part of the motivation of the
EUROPAGATE Project was to create relevant solutions to the
problem.
Other projects and initiatives in a European context are IRIS,
a functioning service in Eire and projects such as DALI (Document
and Library Integration), funded by European Libraries Programme,
Pica, (Holland). Of note are also SOCKER (SR Origin Communication
Kernel)and PARAGON, both coordinated by UNI-C, (Denmark), ONE
(OPAC Network in Europe) and a German national project DBV-OSI
II. Z39.50 is employed commercially in the following products:
Index+, SiteSearch and MetaStar and as freeware in ASF, Cheshire
II and Isite.
Despite complaints that Z39.50 can be long and costly to
implement, it is seen by its supporters as an application
protocol that is capable of "gluing" together the various
components of a distributed network architecture whether
characterised by the MODELS Information Architecture or other
systems. [8]
LDAP [UKOLN LDAP review]
The Lightweight Directory Access Protocol evolved to meet the
need for a less bulky and resource-consuming alternative to the
X.500 Directory Access Protocol. It can run directly on top of
TCP/IP and employs simpler encoding than X500.DAP. It could be
argued that interest has waned in the protocol since the
appearance of more powerful PC's but this would be an
over-simplification for LDAP has regained a degree of acceptance
and some users report significant activity with it. In its purest
form, an LDAP scenario greatly resembles Whois++ in its
generation of referrals to likeliest servers for the user. LDAP
is employed by the ISAAC Project based at University of
Wisconsin-Madison.
|
Relevance to IMesh context
|
Its major relevance lies in its employment
in systems already associated with this project, e.g. ROADS,
Harvest and MetaWeb, and so has some performance history in
addressing the needs of the project, namely to provide cross-
searching across distributed networks.
It is worthy of note that Whois++ is in use in a European
context in the sense that it used for cross-searching in the
ROADS system in U.K. services such as SOSIG and OMNI. Furthermore
it currently forms the basis of the resource finder in RDN. It is
most prominent elsewhere in European associated projects in the
Finnish Virtual Library where it uses the ROADS (v2) software in
conjunction with CIP and encompasses 5 FVL gateways across a very
wide range of disciplines.
However as the functionality required in the Renardus project
becomes more apparent it is possible doubt will arise over the
extensibility of Whois++ as a protocol which has not seen a great
deal of work on it recently. (Although Patrik Faltstrom and
Leslie Daigle published an Internet-Draft in mid-June 2000
regarding the expression of Whois++ protocol [3] queries within
MIME [9]) media types. Their intention is to enable MIME-enabled
mail software, and other systems using Internet media types, to
carry out Whois++ transactions. [9]
|
References
|
[1] A review of metadata: a survey of
current resource description formats, Work Package 3 of
Telematics for Research project
DESIRE(RE004):IAFA/WHOIS++Templates
http://www.ukoln.ac.uk/metadata/desire/overview/
[2] CNIDR (Clearinghouse for Networked Information Discovery
and Retrieval):"Distributed Directory Services Based on the
Whois++ Protocol"
http://dcas.ucdavis.edu/projects/whois/prop.html#chapter1
[3] RFC 1835, 1995, Architecture of the WHOIS++ service. (P.
Deutsch, R. Schoultz, P. Faltstrom and C. Weider). Internet
Engineering Task Force, Network Working Group, August.
http://www.ietf.org/
[4] RFC 1913, 1996, Architecture of the Whois++ Index Service.
(C. Weider, J. Fullton and S. Spero). Internet Engineering Task
Force, NetworkWorking Group, February.
http://www.ietf.org/
[5] RFC 1914, 1996, How to interact with a Whois++ Mesh. (P.
Faltstrom, R. Schoultz and C. Weider).Internet Engineering Task
Force, Network Working Group, February.
http://www.ietf.org/
[6] DESIRE Handbook: Section 3, Technical implementation:
Interoperability,
http://www.ukoln.ac.uk/metadata/desire/handbook/drafts/standards/
[7] RFC: 2651 The Architecture of the Common Indexing Protocol
(CIP), J. Allen, M.Mealling
http://www.ietf.org/ [check
RFC2561 for copyright notice]
[8] "Program" Vol 30, No 1, January 1996 : Towards distributed
library systems: Z39.50 in a European context, Lorcan Dempsey,
Rosemary Russell and John Kirriemuir
http://www.aslib.co.uk/program/1996/jan/02.html
[9] "The application/whoispp-query Content-Type", Patrik
Faltstrom, Leslie Daigle, 06/13/2000
http://www.ietf.org/internet-drafts/draft-daigle-wppquery-02.txt
[10] Martin Hamilton, imesh-toolkit mailbase archive 12 June
2000
|
|