Back to IMesh Toolkit Home Page
Back to IMesh Toolkit Homepage
Subject Gateway Requirements
Technology Review
Work In Hand
  Personalization
Annotation
Reading Lists
OAI  Normalization tools
Metadata Exchange
RDF queries
Evaluation
Dissemination
Project Documentation
Related Links
Project Partners
IMesh Home Page

The IMesh Toolkit

[ Work In Hand > Technology Review > Standards and Protocols ]

Common Indexing Protocol (CIP)

Overall Purpose

The role of the Common Indexing Protocol or CIP is to pass information about the contents of a record between servers and so facilitate access by clients to the data they seek at a later point. This process of referring or replicating queries is known as query routing and is designed to reduce server overload. The latter is a frequent consequence in a system which merely broadcasts searches across a distributed network without a mechanism to direct searching in any fashion. [1]

CIP implements index passing, providing the forward knowledge necessary to generate the referrals used for query routing. Query routing therefore directs queries towards repositories most likely to hold the desired records by referring to the indexing information created. [2] CIP operates on the principle of index summaries, otherwise known in this context as centroids. A centroid is a list of tuples, (i.e. template name,attribute name and token), in which all duplicate token values have been removed. [1] Indiscriminate cross-searching by multiple subject gateways increases the likelihood of server overload __ hence the usefulness of CIP, centroids and Tagged Index Objects. (Unlike a centroid, the Tagged Index Object has the capacity to support the exchange of index update information although the size of such an index object is inevitably greater. It labels items of information with identifiers that relate individual object attributes back to the object as a whole, i.e. a process of tagging. This enables an index server to direct a specific query to the correct information server more effectively than a centroid which does not support incremental updates natively).These elements enable the system to forward queries only to those services identified as likely to hold a requested record. However in adopting query routing, it is incumbent upon developers to avoid both the circular routing of queries or their transmission to mirrors of the same metadata. Otherwise the user, for a start, will inevitably experience multiple hits. [3]

Indeed CIP was embedded in version 1 of Whois++, (a lightweight extension of Whois a rather restricted white pages directory service), and was not referred to at that point as CIP, but came to be abstracted from it in subsequent versions. In version 1 it could only support ISO-8895-1 characters and the centroid index object type. CIP seeks to disaggregate the indexing elements of Whois++ and to turn the ad hoc data representation of the data access protocol into MIME-specification-based structures. Whilst there was development in version 2 of Whois++ and the use of its centroid by Bunyip Information Systems' Digger Software, it is in version 3 of CIP that major changes have occurred to the protocol. It should be appreciated that CIP is a "backend" protocol; it is implemented in the context of network servers which must themselves employ some form of data access protocol to communicate with clients. During query resolution in the native protocol implementation, the server will refer to the indexing information collected by the CIP implementation for guidance on how to route the query. Data access protocols used with CIP must have some provision for control information in the form of a referral.

Brief Overview of Functionality

In other words, a distinction should be drawn between data access protocols such as Whois++, LDAP, HTTP and Z39.50 and an indexing protocol such as CIP. Whilst it is perfectly possible for the former to provide information from a local dataset, they cannot refer to records held externally since they do not possess any external indices. It is the task of an indexing protocol such as CIP to furnish external indices obtained from peer servers. However the distinction should also be drawn between the referrals that CIP- equipped servers can provide and the client's task of deciding what to do with referrals it receives. Some may see the need to do this, in the case of an actual human end-user, as something of an imposition. Others might argue it maintains user independence and control.

Note also that not all queries will be served by one type of index, (say, for example, where a keyword search is looking for two or more words in close proximity in a text, or for a keyword in a title) . Indices therefore must be specific to the application domain. Central to the CIP protocol is the CIP index object which comprises the header and the payload. The latter holds the actual index and is invisible to CIP itself, defined as it is by the index object specification that relates to the object's MIME type. The header, which varies in type, contains the metadata required to process and use the index object being passed. The system that passes these indices between servers, (over reliable transport mechanisms such as TCP or Internet mail messages), is a CIP mesh. It is arranged, typically, in a hierarchical tree. Those servers closest to the root of this tree hold the larger and more comprehensive indices. However there is no defined structure as such and indeed a degree of flexibility, particularly in respect of lateral links, is to be encouraged in order to obviate the threat of referral loops. However one assumption may be made in that all indices passed across a CIP mesh are of the same type, i.e. as determined by the CIP index object specification). While it may be possible to realise gateways between meshes which are passing index objects of different types, this is as yet far from defined.

Within the CIP mesh servers cooperate to pass the indexing information around the mesh so that forward knowledge is correctly placed and query routing is indeed effective. In common with Whois++, there exists between two such servers what is termed a polling relationship. The server that holds the data of interest, and creates an index, is termed the polled server; the polling server is that which collects the index generated by the polled server. Another relationship, which is termed an index pushing relationship, still operates between peers, but the polling server is permitted to refuse, accept, accept and discard or merely accept portions of the generated indices via filtering.

Deployment

The advantage to this index pushing relationship is that leaf nodes wishing to participate in the mesh by making their index available, are not obliged to support the complete CIP protocol, thus lowering the entry point for participation whilst encouraging the generation of more information.

One should recall the principle of the centroid, in which redundant or duplicate values are eliminated. Whilst a CIP server is able to pass on an index object unaltered, there is a benefit to the system if it is able to compress or aggregate two or more objects before onward transmission. The balance must be maintained between the elimination through compression of duplicate data and the loss of data that would provide hints useful to routing enquiries through excessive compression. In effect the trade-off between excessive aggregation and aggressive referral chain optimisation can only be analysed within its particular application domain and after actual operation. The degree of error considered acceptable must be decided on a per-application-domain basis.

However, things get significantly more difficult when CIP is employed in a multi-protocol application domain. The essential difficulty is to avoid forcing a referral chain to pass through part of the mesh which does not support the protocol by which that client initiated the query. If this occurs, the client loses access to any hits beyond that point in the referral chain; it cannot resolve the referral in its native data access protocol, resulting in a failure of query routing.

A further point of interest is raised in the context of the referral. When sending the Dataset Identifier,(DSI), where possible, it is recommended that the DSI-Description header be transmitted also. (A DSI-Description is a human-readable string optionally carried along with DSI's to make them more user-friendly). It gives the client the opportunity to check with a user prior to chasing the referral and is the clearest representation of the DSI that CIP offers. [1]

In the same context, if a client is programmed to follow all referrals that it receives, this throws up the issue of how to avoid an infinite loop of referrals. (Since the mesh can be defined as a graph of CIP servers that may have cycles, that could cause such loops). The solution lies in some form of mesh traversal algorithm such as that documented for use with Whois++. [4] By replacing the Whois++ handle, (or unique identifier), with the Dataset Identifier of CIP, (version 3), the same algorithm will operate with CIP. [1]

Equally CIP v.3 benefits from the polling process as described in the Whois++ architecture, [5] since it permits the expansion of the client search beyond the entry point such that it can access the hints from those servers polled by the server at the entry point to the mesh. The likelihood of search failure through excessive concentration on a single entry point server is consequently much reduced.

Whichever data access protocol one adopts to work with CIP, a clearly defined mapping is called for in order to map queries in the native protocol to searches against an index object. The resultant mapping rules may vary according to domain. Theoretically the number of mappings required should represent the multiple of the number of protocols and domains involved __ a possible scaling problem. However, in reality some protocols will prove wholly inappropriate for certain domains and so the overall total will be lower. Nonetheless, whereas mapping between CIP and Whois++ is very close, (thanks to the evolution of CIP from Whois++ as described above), it will be necessary to ensure that mappings between CIP and existing or emerging protocols are specified in the index object specification.

However, three security issues also surround CIP and may present difficulties. Unless carefully controlled, indexing information may leak unacceptable amounts of proprietary information. Furthermore, not only does CIP itself require external security services in order to operate securely, it can also be misused in order to transmit incorrect information, (for example, CIP does not support any trust method capable of filtering false data which could be inserted into the server mesh for propagation). Therefore any use of CIP with a database on which resided certain data of a restricted nature would need careful consideration. Whilst it would be important to create an index which summarises the information held, it is equally important to protect the integrity of the database information. This should be borne in mind when developing a new index object type.

Related Standards

Z39.50 [UKOLN Z39.50 review]

Z39.50 is a powerful searching tool using a generalised search syntax. It has the capacity to facilitate distributed applications and retrieve structured data from remote heterogenous databases. In its terminology it describes a client as an "origin" and a server as a "target" (of the origin's requests). Origin and target communicate by PDU's, (Protocol Data Units), which largely operate in pairs, request and response.

Its key functionality can be summarised as a series of services:
Initialisation : which seeks to set up the association between origin and target.
Search : involving the passing of a Z39.50 compliant query and its use to search a database any subsequent storage of any results. It should be noted that the target sends not the records themselves to the origin but details of those records.
Present : this permits the origin or user to request those records or a subset of them.

Authentication, record deletion and the use of resources within the current client/server dialogue are also addressed. Later versions include Explain, which permits the client to retrieve information on server-side components, and the retention of current results for later use.

Z39.50 is not without its difficulties. It would no longer be entirely true to characterise the development of Z39.50 as divided across two continents,North America and Europe and across two environments, TCP/IP (Transmission Control Protocol/Internet Protocol) and OSI (Open Systems Interconnection) respectively. However, earliest implementations of Z39.50 were thus divided and suffered from a lack of interoperability as a result. Part of the motivation of the EUROPAGATE Project was to create relevant solutions to the problem.

Other projects and initiatives in a European context are IRIS, a functioning service in Eire and projects such as DALI (Document and Library Integration), funded by European Libraries Programme, Pica, (Holland). Of note are also SOCKER (SR Origin Communication Kernel)and PARAGON, both coordinated by UNI-C, (Denmark), ONE (OPAC Network in Europe) and a German national project DBV-OSI II. Z39.50 is employed commercially in the following products: Index+, Site Search and MetaStar and as freeware in ASF, Cheshire II and Isite.

Despite complaints that Z39.50 can be long and costly to implement, it is seen by its supporters as an application protocol that is capable of "gluing" together the various components of a distributed network architecture whether characterised by the MODELS Information Architecture or other systems.

LDAP [UKOLN LDAP review]

The Lightweight Directory Access Protocol evolved to meet the need for a less bulky and resource-consuming alternative to the X.500 Directory Access Protocol. It can run directly on top of TCP/IP and employs simpler encoding than X500.DAP. It could be argued that interest has waned in the protocol since the appearance of more powerful PC's but this would be an over-simplification for LDAP has regained a degree of acceptance and some users report significant activity with it. In its purest form, an LDAP scenario greatly resembles Whois++ in its generation of referrals to likeliest servers for the user. LDAP is employed by the ISAAC Project based at University of Wisconsin-Madison.

Whois++ [UKOLN WHOIS++ review]

Based on Whois, a rather restricted white pages directory service, Whois++ is a lightweight extension offering cross-searching over a distributed network of databases including multiple gateways. It is designed to function as a simple lookup service but with a degree of flexibility that avoids imposing constraints upon developers.

In its evolution beyond the original Whois, this protocol has acquired more advanced search processes enhanced by the addition of global and local constraints and the use of Boolean operators. Further options include languages other than English, additional character sets and, most importantly, the use of structured data to make searching more effective. The structured data is in the form of an information template which is central to the Whois++ operation.

Its close relationship to the Common Indexing Protocol is a major strength of Whois++. Indeed CIP was embedded in version 1 of Whois++ and came to be abstracted from it in subsequent versions. Consequently there is a particularly close mapping whereas other pre-existing protocols may require more work to collaborate with CIP. Equally if the Whois++ handle is substituted by the DSI, (Dataset Identifier), the original Whois++ mesh traversal algorithm can operate unchanged with CIP.

Relevance to IMesh context

Perhaps the principal interest and relevance to the IMesh Toolkit project lies in the advantages derived from CIP with regard to avoiding overloading during search requests. This represents an even greater threat where subject gateways are operating in an international context with the possible collaboration of a plethora of repositories and the attendant increase in search requests and end users. This is where query routing is able to provide a possible solution to this potential threat. The use of centroids and Tagged Index Objects in sending queries to services most likely to hold a desired record reduces the likelihood of circular routing of queries.

References

[1] RFC2651; The Architecture of the Common Indexing Protocol (CIP);
http://www.faqs.org/rfcs/rfc2651.html

[2] DESIRE Information Gateways Handbook
http://www.desire.org/handbook/

[3] Renardus Project Technical Standards Report
http://nwi.dtv.dk/RENARDUS/D2.1/RDF.html#S3

[4] RFC 1914, 1996, How to interact with a Whois++ Mesh. (P. Faltstrom, R. Schoultz and C. Weider). Internet Engineering Task Force, Network Working Group, February.
http://www.ietf.org/

[5] RFC 1913, 1996, Architecture of the Whois++ Index Service. (C. Weider, J. Fullton and S. Spero). Internet Engineering Task Force, NetworkWorking Group, February.
http://www.ietf.org/

Further Information

Cross-Searching Subject Gateways : The Query Routing and Forward Knowledge Approach, D-Lib Magazine, January 1998
http://www.dlib.org/dlib/january98/01kirriemuir.html

FIND; The Internet Engineering Task Force, charter for FIND. Last update April 1998.
http://www.ietf.org/html.charters/find-charter.html

"imesh-cip" mailing list: four relevant entries (besides conference calls) dati ng back to October 1999.
http://www.mailbase.ac.uk/lists/imesh-cip/

IMesh Toolkit IMesh Toolkit IDL Project Proposal
http://www.imesh.org/toolkit/proposal/

Lyngby architecture, M. Sandf\206r;Architectural options, Renardus technical meeting, Lyngby 1999;
http://www.kb.nl/coop/reynard/restricted/architecture.ppt

RFC2652; MIME Object Definitions for the Common Indexing Protocol (CIP)
http://www.faqs.org/rfcs/rfc2652.html

RFC2653; CIP Transport Protocols
http://www.faqs.org/rfcs/rfc2653.html

Other Standards and Protocols

CIP DC LDAP OAI
RDF RSS SDLIP SOAP
WHOIS++ XHTML XML Z39.50