|
The IMesh Toolkit[ Work In Hand > Components > OAI Normalization tools] OAI normalization tools - Introduction and BackgroundMetadata records in the Open Archives InitiativeThe Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. It defines an interoperability framework with a mechanism for data providers to expose their metadata. Data Providers manage a repository, which is a network-accessible server that exposes the metadata to harvesters. The technical infrastructure is specified in the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). The OAI-PMH defines a mechanism for harvesting XML-formatted metadata from repositories . The OAI-PMH mandates that at a minimum, repositories must be able to return records with metadata expressed in the Dublin Core format, without any qualification. Cooperation between the OAI and the Dublin Core Metadata Initiative has led to a common xml schema for unqualified dublin core that is available at http://dublincore.org/schemas/xmls/simpledc20020312.xsd. The requirements for conformance with record format are explained (with examples) as part of the protocol specification. See: http://www.openarchives.org/OAI/openarchivesprotocol.htm#Record http://www.openarchives.org/OAI/openarchivesprotocol.htm#dublincore Data providers are not restricted to one metadata format and may optionally provide multiple metadata sets, specific to their applications and domains. However, in this work we are concerned only with metadata records conforming to the above xml schema for unqualified Dublin Core. Kinds of transformations requiredHarvesters are client applications that issue OAI-PMH requests to collect metadata from repositories. Harvesters will typically be used by service providers, to aggregate records from different data providers, and provide services based on the collected data. One example of a service provider is the Resource Discovery Network (RDN) which has taken a collaborative approach to the development of a network of subject gateways, each of which offers a variety of services to its subject-focused community. The RDN uses the Open Archives Initiative Protocol for Metadata Harvesting as a mechanism for sharing metadata records between those gateways in order to build cross-subject resource discovery services, as described in [1]. Although metadata records made available by data providers are required to conform with with the prescribed schema, there is scope for local usage variations (without breaking conformance). Within the RDN, there are agreed guidelines for use of DC but some divergence is allowed across the gateways for particular elements. For example, each gateway uses a subject specific classification scheme. The aggregated records harvested by service providers may thus manifest some inconsistencies arising from the (allowed) variation in local practice at the different sources (i.e. the data providers). Following harvesting, records may need to be normalised to make them more consistent, for example so that they all use the same dc:type vocabulary.
An example of a very simple transformation is one in which the
text content of a tag (or element) is replaced by another
piece of text. So in the case of the dc:type example, this may mean
replacing a dc:type tag containing the word 'pamphlet' with a dc:type
element which shows the type 'Text'.
More complex changes to records may introduce new element tags. Assuming the
country of origin of a web resource (described by the metadata record) can be
inferred from the URL of the web resource, then an additional < country > tag
can be added to records, based on the dc:identifier tag in the metadata
portion of the record. An example (based on RDN records) is the
following:
The process of manipulating the records to introduce consistency can be automated; below we describe two approaches to effect transformations on records (using records from the RDN as an example). Using the power of Regular Expressions in Perl to effect transformationsPerl has powerful pattern matching capabilities. Patterns are specified using regular expressions. Some of the transformations explained above can be thought of as simple string transformations based on pattern matching, and regular expressions can be used to express the changes required. Thus the transformation of <dc:type>Pamphlet</dc:type> to <dc:type>Text</dc:type> is a simple replacement of the string Pamphlet with the string Text. This can be achieved using the Perl regular expression S/PATTERN/REPLACEMENT/ The way this works in Perl is that a sussessful match with the PATTERN causes the matched portion to be replaced with REPLACEMENT. For example s/pamphlet/Text/i substitues the string Pamphlet with the string Text. To apply more than one of these transformations simultaneously, it is useful to be able to express a number of them in a configuration file which the tool then reads and applies to the records. The element tag to which the transformation applies also needs to be indicated in the configuration file to allow selective changes, i.e. we would not wish the the occurrence of the string 'pamphlet' within a <description> element to be substituted with the string text, so we can limit the substitution to the <type> element. Although at one level we are treating the content of the record as simple strings (and in so far as the text between elements is concerned this is generally correct), the record is itself a structured document encoded in XML. We can make use of the inherent structure of the document for example to discern between text content and element content by using Perl libraries for XML. In our examples we are using the freely-available XML:Simple module, an API for reading and writing XML. Using XSLTXSLT is the Extensible Stylesheet Language: Transformations, part of the XSL group of W3C recommendations, designed for transforming one XML document into another. XSLT is a declarative language, in which the transformations required are expressed as a set of rules. The rules define what output should be generated when particular patterns occur in the input.[2] An XSLT processor applies an XSLT stylesheet to an XML document and produces a result document. A number of XSLT processors are available, some of them free. These include:
XSLT stylesheets are written in XML, and consist of a number of rules. Template rules are encoded as <xsl:template> elements and are trigerred when a particular part of a source document is processed. By giving an expression (or pattern) for a match attribute, we can declare what parts of the document the template rule should be applied to. XPath expressions are also used in a number of ways in XSLT stylesheets, to access or refer to parts of the source XML document, for example, as Path Expressions. The XSLT tools will thus consist of a number of stylesheets which can be applied to one or more OAI records, generating the modified OAI record expressed in XML. The transformations are written as rules in the stylesheets, and the stylesheet rules can be modified to effect different kinds of transformations. Discussion of the two approcahes. Read more about the tools that implement these ideas References[1] http://www.rdn.ac.uk/publications/www10/oaiposter.pdf[2] Kay, Michael. XSLT Programmer's Reference, Wrox |
|||||||||||||||||||||||
|
|
||||||||||||||||||||||||