Back to IMesh Toolkit Home Page
Back to IMesh Toolkit Homepage
Subject Gateway Requirements
Technology Review
Work In Hand
  Personalization
Annotation
Reading Lists
OAI  Normalization tools
Metadata Exchange
RDF queries
Evaluation
Dissemination
Project Documentation
Related Links
Project Partners
IMesh Home Page

The IMesh Toolkit

[ Work In Hand > Components > OAI Normalization tools]

OAI normalization tools - Introduction and Background


Metadata records in the Open Archives Initiative

The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. It defines an interoperability framework with a mechanism for data providers to expose their metadata. Data Providers manage a repository, which is a network-accessible server that exposes the metadata to harvesters. The technical infrastructure is specified in the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).

The OAI-PMH defines a mechanism for harvesting XML-formatted metadata from repositories . The OAI-PMH mandates that at a minimum, repositories must be able to return records with metadata expressed in the Dublin Core format, without any qualification. Cooperation between the OAI and the Dublin Core Metadata Initiative has led to a common xml schema for unqualified dublin core that is available at http://dublincore.org/schemas/xmls/simpledc20020312.xsd.

The requirements for conformance with record format are explained (with examples) as part of the protocol specification. See:
http://www.openarchives.org/OAI/openarchivesprotocol.htm#Record
http://www.openarchives.org/OAI/openarchivesprotocol.htm#dublincore

Data providers are not restricted to one metadata format and may optionally provide multiple metadata sets, specific to their applications and domains. However, in this work we are concerned only with metadata records conforming to the above xml schema for unqualified Dublin Core.

Kinds of transformations required

Harvesters are client applications that issue OAI-PMH requests to collect metadata from repositories. Harvesters will typically be used by service providers, to aggregate records from different data providers, and provide services based on the collected data.

One example of a service provider is the Resource Discovery Network (RDN) which has taken a collaborative approach to the development of a network of subject gateways, each of which offers a variety of services to its subject-focused community. The RDN uses the Open Archives Initiative Protocol for Metadata Harvesting as a mechanism for sharing metadata records between those gateways in order to build cross-subject resource discovery services, as described in [1].

Although metadata records made available by data providers are required to conform with with the prescribed schema, there is scope for local usage variations (without breaking conformance). Within the RDN, there are agreed guidelines for use of DC but some divergence is allowed across the gateways for particular elements. For example, each gateway uses a subject specific classification scheme.

The aggregated records harvested by service providers may thus manifest some inconsistencies arising from the (allowed) variation in local practice at the different sources (i.e. the data providers). Following harvesting, records may need to be normalised to make them more consistent, for example so that they all use the same dc:type vocabulary.

An example of a very simple transformation is one in which the text content of a tag (or element) is replaced by another piece of text. So in the case of the dc:type example, this may mean replacing a dc:type tag containing the word 'pamphlet' with a dc:type element which shows the type 'Text'.
<dc:type>Pamphlet</dc:type> becomes
<dc:type>Text</dc:type>
A similar kind of change is one in which a Dewey notation is changed to a Dewey caption, e.g. <dc:subject>XXX.XX</dc:subject> (where XXX.XX represents the Dewey class number) becomes <dc:subject>Subject</dc:subject>
where Subject is the subject description corresponding to the class number.

More complex changes to records may introduce new element tags. Assuming the country of origin of a web resource (described by the metadata record) can be inferred from the URL of the web resource, then an additional < country > tag can be added to records, based on the dc:identifier tag in the metadata portion of the record. An example (based on RDN records) is the following:
A metadata record describing the web resource found at http://www.fairtrade.org.uk/unpeeling.htm might contain within the metadata section of the record the following tag showing the origin of the resource being described by the record: <identifier>http://www.fairtrade.org.uk/unpeeling.htm</identifier>
This would cause the addition of a new tag <country>GB</country>

The process of manipulating the records to introduce consistency can be automated; below we describe two approaches to effect transformations on records (using records from the RDN as an example).

Using the power of Regular Expressions in Perl to effect transformations

Perl has powerful pattern matching capabilities. Patterns are specified using regular expressions. Some of the transformations explained above can be thought of as simple string transformations based on pattern matching, and regular expressions can be used to express the changes required. Thus the transformation of <dc:type>Pamphlet</dc:type> to <dc:type>Text</dc:type> is a simple replacement of the string Pamphlet with the string Text. This can be achieved using the Perl regular expression S/PATTERN/REPLACEMENT/ The way this works in Perl is that a sussessful match with the PATTERN causes the matched portion to be replaced with REPLACEMENT. For example s/pamphlet/Text/i substitues the string Pamphlet with the string Text.

To apply more than one of these transformations simultaneously, it is useful to be able to express a number of them in a configuration file which the tool then reads and applies to the records. The element tag to which the transformation applies also needs to be indicated in the configuration file to allow selective changes, i.e. we would not wish the the occurrence of the string 'pamphlet' within a <description> element to be substituted with the string text, so we can limit the substitution to the <type> element.

Although at one level we are treating the content of the record as simple strings (and in so far as the text between elements is concerned this is generally correct), the record is itself a structured document encoded in XML. We can make use of the inherent structure of the document for example to discern between text content and element content by using Perl libraries for XML. In our examples we are using the freely-available XML:Simple module, an API for reading and writing XML.

Using XSLT

XSLT is the Extensible Stylesheet Language: Transformations, part of the XSL group of W3C recommendations, designed for transforming one XML document into another. XSLT is a declarative language, in which the transformations required are expressed as a set of rules. The rules define what output should be generated when particular patterns occur in the input.[2]

An XSLT processor applies an XSLT stylesheet to an XML document and produces a result document. A number of XSLT processors are available, some of them free. These include:

  • The Saxon XSLT processor by Michael Kay - a java application.
  • Xalan from the Apache XML project.
  • Microsoft's MSXML 4.0 includes an XSLT processor.

XSLT stylesheets are written in XML, and consist of a number of rules. Template rules are encoded as <xsl:template> elements and are trigerred when a particular part of a source document is processed. By giving an expression (or pattern) for a match attribute, we can declare what parts of the document the template rule should be applied to. XPath expressions are also used in a number of ways in XSLT stylesheets, to access or refer to parts of the source XML document, for example, as Path Expressions.

The XSLT tools will thus consist of a number of stylesheets which can be applied to one or more OAI records, generating the modified OAI record expressed in XML. The transformations are written as rules in the stylesheets, and the stylesheet rules can be modified to effect different kinds of transformations.

Discussion of the two approcahes.

Read more about the tools that implement these ideas

References

[1] http://www.rdn.ac.uk/publications/www10/oaiposter.pdf
[2] Kay, Michael. XSLT Programmer's Reference, Wrox