|
|
The IMesh Toolkit
[ Work In Hand > Technology Review > Standards and
Protocols ]
Common Indexing Protocol (CIP)
|
Overall Purpose
|
The role of the Common Indexing Protocol or
CIP is to pass information about the contents of a record between
servers and so facilitate access by clients to the data they seek
at a later point. This process of referring or replicating
queries is known as query routing and is designed to reduce
server overload. The latter is a frequent consequence in a system
which merely broadcasts searches across a distributed network
without a mechanism to direct searching in any fashion. [1]
CIP implements index passing, providing the forward knowledge
necessary to generate the referrals used for query routing. Query
routing therefore directs queries towards repositories most
likely to hold the desired records by referring to the indexing
information created. [2] CIP operates on the principle of index
summaries, otherwise known in this context as centroids. A
centroid is a list of tuples, (i.e. template name,attribute name
and token), in which all duplicate token values have been
removed. [1] Indiscriminate cross-searching by multiple subject
gateways increases the likelihood of server overload __ hence the
usefulness of CIP, centroids and Tagged Index Objects. (Unlike a centroid,
the Tagged Index Object has the capacity to support the exchange of index update
information although the size of such an index object is inevitably greater. It
labels items of information with identifiers that relate individual object attributes
back to the object as a whole, i.e. a process of tagging. This enables an index server
to direct a specific query to the correct information server more effectively than a
centroid which does not support incremental updates natively).These
elements enable the system to forward queries only to those
services identified as likely to hold a requested record. However
in adopting query routing, it is incumbent upon developers to
avoid both the circular routing of queries or their transmission
to mirrors of the same metadata. Otherwise the user, for a start,
will inevitably experience multiple hits. [3]
Indeed CIP was embedded in version 1 of Whois++, (a
lightweight extension of Whois a rather restricted white pages
directory service), and was not referred to at that point as CIP,
but came to be abstracted from it in subsequent versions. In
version 1 it could only support ISO-8895-1 characters and the
centroid index object type. CIP seeks to disaggregate the
indexing elements of Whois++ and to turn the ad hoc data
representation of the data access protocol into
MIME-specification-based structures. Whilst there was development
in version 2 of Whois++ and the use of its centroid by Bunyip
Information Systems' Digger Software, it is in version 3 of CIP
that major changes have occurred to the protocol. It should be
appreciated that CIP is a "backend" protocol; it is implemented
in the context of network servers which must themselves employ
some form of data access protocol to communicate with clients.
During query resolution in the native protocol implementation,
the server will refer to the indexing information collected by
the CIP implementation for guidance on how to route the query.
Data access protocols used with CIP must have some provision for
control information in the form of a referral.
|
Brief Overview of Functionality
|
In other words, a distinction should be
drawn between data access protocols such as Whois++, LDAP, HTTP
and Z39.50 and an indexing protocol such as CIP. Whilst it is
perfectly possible for the former to provide information from a
local dataset, they cannot refer to records held externally since
they do not possess any external indices. It is the task of an
indexing protocol such as CIP to furnish external indices
obtained from peer servers. However the distinction should also
be drawn between the referrals that CIP- equipped servers can
provide and the client's task of deciding what to do with
referrals it receives. Some may see the need to do this, in the
case of an actual human end-user, as something of an imposition.
Others might argue it maintains user independence and control.
Note also that not all queries will be served by one type of
index, (say, for example, where a keyword search is looking for
two or more words in close proximity in a text, or for a keyword
in a title) . Indices therefore must be specific to the
application domain. Central to the CIP protocol is the CIP index
object which comprises the header and the payload. The latter
holds the actual index and is invisible to CIP itself, defined as
it is by the index object specification that relates to the
object's MIME type. The header, which varies in type, contains
the metadata required to process and use the index object being
passed. The system that passes these indices between servers,
(over reliable transport mechanisms such as TCP or Internet mail
messages), is a CIP mesh. It is arranged, typically, in a
hierarchical tree. Those servers closest to the root of this tree
hold the larger and more comprehensive indices. However there is
no defined structure as such and indeed a degree of flexibility,
particularly in respect of lateral links, is to be encouraged in
order to obviate the threat of referral loops. However one
assumption may be made in that all indices passed across a CIP
mesh are of the same type, i.e. as determined by the CIP index
object specification). While it may be possible to realise
gateways between meshes which are passing index objects of
different types, this is as yet far from defined.
Within the CIP mesh servers cooperate to pass the indexing
information around the mesh so that forward knowledge is
correctly placed and query routing is indeed effective. In common
with Whois++, there exists between two such servers what is
termed a polling relationship. The server that holds the data of
interest, and creates an index, is termed the polled server; the
polling server is that which collects the index generated by the
polled server. Another relationship, which is termed an index
pushing relationship, still operates between peers, but the
polling server is permitted to refuse, accept, accept and discard
or merely accept portions of the generated indices via
filtering.
|
Deployment
|
The advantage to this index pushing
relationship is that leaf nodes wishing to participate in the
mesh by making their index available, are not obliged to support
the complete CIP protocol, thus lowering the entry point for
participation whilst encouraging the generation of more
information.
One should recall the principle of the centroid, in which
redundant or duplicate values are eliminated. Whilst a CIP server is able
to pass on an index object unaltered, there is a benefit to the system if
it is able to compress or aggregate two or more objects before onward
transmission. The balance must be maintained between the elimination
through compression of duplicate data and the loss of data that would
provide hints useful to routing enquiries through excessive compression.
In effect the trade-off between excessive aggregation and aggressive referral
chain optimisation can only be analysed within its particular application domain
and after actual operation. The degree of error considered acceptable must be
decided on a per-application-domain basis.
However, things get significantly more difficult when CIP is employed in
a multi-protocol application domain. The essential difficulty is to avoid forcing a
referral chain to pass through part of the mesh which does not support the protocol
by which that client initiated the query. If this occurs, the client loses access to
any hits beyond that point in the referral chain; it cannot resolve the referral in
its native data access protocol, resulting in a failure of query routing.
A further point of interest is raised in the context of the
referral. When sending the Dataset Identifier,(DSI), where
possible, it is recommended that the DSI-Description header be
transmitted also. (A DSI-Description is a human-readable string
optionally carried along with DSI's to make them more
user-friendly). It gives the client the opportunity to check with
a user prior to chasing the referral and is the clearest
representation of the DSI that CIP offers. [1]
In the same context, if a client is programmed to follow all
referrals that it receives, this throws up the issue of how to
avoid an infinite loop of referrals. (Since the mesh can be
defined as a graph of CIP servers that may have cycles, that
could cause such loops). The solution lies in some form of mesh
traversal algorithm such as that documented for use with Whois++.
[4] By replacing the Whois++ handle, (or unique identifier), with
the Dataset Identifier of CIP, (version 3), the same algorithm
will operate with CIP. [1]
Equally CIP v.3 benefits from the polling process as described
in the Whois++ architecture, [5] since it permits the expansion
of the client search beyond the entry point such that it can
access the hints from those servers polled by the server at the
entry point to the mesh. The likelihood of search failure through
excessive concentration on a single entry point server is
consequently much reduced.
Whichever data access protocol one adopts to work with CIP, a
clearly defined mapping is called for in order to map queries in
the native protocol to searches against an index object. The
resultant mapping rules may vary according to domain.
Theoretically the number of mappings required should represent
the multiple of the number of protocols and domains involved __ a
possible scaling problem. However, in reality some protocols will
prove wholly inappropriate for certain domains and so the overall
total will be lower. Nonetheless, whereas mapping between CIP and
Whois++ is very close, (thanks to the evolution of CIP from
Whois++ as described above), it will be necessary to ensure that
mappings between CIP and existing or emerging protocols are
specified in the index object specification.
However, three security issues also surround CIP and may
present difficulties. Unless carefully controlled, indexing
information may leak unacceptable amounts of proprietary
information. Furthermore, not only does CIP itself require
external security services in order to operate securely, it can
also be misused in order to transmit incorrect information, (for
example, CIP does not support any trust method capable of
filtering false data which could be inserted into the server mesh
for propagation). Therefore any use of CIP with a database on
which resided certain data of a restricted nature would need
careful consideration. Whilst it would be important to create an
index which summarises the information held, it is equally
important to protect the integrity of the database information.
This should be borne in mind when developing a new index object
type.
|
Related Standards
|
Z39.50 [UKOLN Z39.50 review]
Z39.50 is a powerful searching tool using a generalised search
syntax. It has the capacity to facilitate distributed
applications and retrieve structured data from remote
heterogenous databases. In its terminology it describes a client
as an "origin" and a server as a "target" (of the origin's
requests). Origin and target communicate by PDU's, (Protocol Data
Units), which largely operate in pairs, request and response.
Its key functionality can be summarised as a series of
services:
Initialisation : which seeks to set up the association between
origin and target.
Search : involving the passing of a Z39.50 compliant query and
its use to search a database any subsequent storage of any
results. It should be noted that the target sends not the records
themselves to the origin but details of those records.
Present : this permits the origin or user to request those
records or a subset of them.
Authentication, record deletion and the use of resources
within the current client/server dialogue are also addressed.
Later versions include Explain, which permits the client to
retrieve information on server-side components, and the retention
of current results for later use.
Z39.50 is not without its difficulties. It would no
longer be entirely true to characterise the development of Z39.50
as divided across two continents,North America and Europe and
across two environments, TCP/IP (Transmission Control
Protocol/Internet Protocol) and OSI (Open Systems
Interconnection) respectively. However, earliest implementations
of Z39.50 were thus divided and suffered from a lack of
interoperability as a result. Part of the motivation of the
EUROPAGATE Project was to create relevant solutions to the
problem.
Other projects and initiatives in a European context are IRIS,
a functioning service in Eire and projects such as DALI (Document
and Library Integration), funded by European Libraries Programme,
Pica, (Holland). Of note are also SOCKER (SR Origin Communication
Kernel)and PARAGON, both coordinated by UNI-C, (Denmark), ONE
(OPAC Network in Europe) and a German national project DBV-OSI
II. Z39.50 is employed commercially in the following products:
Index+, Site Search and MetaStar and as freeware in ASF, Cheshire
II and Isite.
Despite complaints that Z39.50 can be long and costly to
implement, it is seen by its supporters as an application
protocol that is capable of "gluing" together the various
components of a distributed network architecture whether
characterised by the MODELS Information Architecture or other
systems.
LDAP [UKOLN LDAP review]
The Lightweight Directory Access Protocol evolved to meet the
need for a less bulky and resource-consuming alternative to the
X.500 Directory Access Protocol. It can run directly on top of
TCP/IP and employs simpler encoding than X500.DAP. It could be
argued that interest has waned in the protocol since the
appearance of more powerful PC's but this would be an
over-simplification for LDAP has regained a degree of acceptance
and some users report significant activity with it. In its purest
form, an LDAP scenario greatly resembles Whois++ in its
generation of referrals to likeliest servers for the user. LDAP
is employed by the ISAAC Project based at University of
Wisconsin-Madison.
Whois++ [UKOLN WHOIS++ review]
Based on Whois, a rather restricted white pages directory
service, Whois++ is a lightweight extension offering
cross-searching over a distributed network of databases including
multiple gateways. It is designed to function as a simple lookup
service but with a degree of flexibility that avoids imposing
constraints upon developers.
In its evolution beyond the original Whois, this protocol has
acquired more advanced search processes enhanced by the addition
of global and local constraints and the use of Boolean operators.
Further options include languages other than English, additional
character sets and, most importantly, the use of structured data
to make searching more effective. The structured data is in the
form of an information template which is central to the Whois++
operation.
Its close relationship to the Common Indexing Protocol is a
major strength of Whois++. Indeed CIP was embedded in version 1
of Whois++ and came to be abstracted from it in subsequent
versions. Consequently there is a particularly close mapping
whereas other pre-existing protocols may require more work to
collaborate with CIP. Equally if the Whois++ handle is
substituted by the DSI, (Dataset Identifier), the original
Whois++ mesh traversal algorithm can operate unchanged with
CIP.
|
Relevance to IMesh context
|
| Perhaps the principal interest and
relevance to the IMesh Toolkit project lies in the advantages
derived from CIP with regard to avoiding overloading during
search requests. This represents an even greater threat where
subject gateways are operating in an international context with
the possible collaboration of a plethora of repositories and the
attendant increase in search requests and end users. This is
where query routing is able to provide a possible solution to
this potential threat. The use of centroids and Tagged Index
Objects in sending queries to services most likely to hold a
desired record reduces the likelihood of circular routing of
queries. |
References
|
[1] RFC2651; The Architecture of the Common
Indexing Protocol (CIP);
http://www.faqs.org/rfcs/rfc2651.html
[2] DESIRE Information Gateways Handbook
http://www.desire.org/handbook/
[3] Renardus Project Technical Standards Report
http://nwi.dtv.dk/RENARDUS/D2.1/RDF.html#S3
[4] RFC 1914, 1996, How to interact with a Whois++ Mesh. (P.
Faltstrom, R. Schoultz and C. Weider). Internet Engineering Task
Force, Network Working Group, February.
http://www.ietf.org/
[5] RFC 1913, 1996, Architecture of the Whois++ Index Service.
(C. Weider, J. Fullton and S. Spero). Internet Engineering Task
Force, NetworkWorking Group, February.
http://www.ietf.org/
Further Information
Cross-Searching Subject Gateways : The Query Routing and
Forward Knowledge Approach, D-Lib Magazine, January 1998
http://www.dlib.org/dlib/january98/01kirriemuir.html
FIND; The Internet Engineering Task Force, charter for FIND.
Last update April 1998.
http://www.ietf.org/html.charters/find-charter.html
"imesh-cip" mailing list: four relevant entries (besides
conference calls) dati ng back to October 1999.
http://www.mailbase.ac.uk/lists/imesh-cip/
IMesh Toolkit IMesh Toolkit IDL Project Proposal
http://www.imesh.org/toolkit/proposal/
Lyngby architecture, M. Sandf\206r;Architectural options,
Renardus technical meeting, Lyngby 1999;
http://www.kb.nl/coop/reynard/restricted/architecture.ppt
RFC2652; MIME Object Definitions for the Common Indexing
Protocol (CIP)
http://www.faqs.org/rfcs/rfc2652.html
RFC2653; CIP Transport Protocols
http://www.faqs.org/rfcs/rfc2653.html
|
|