As mentioned above, there can be only one indexing pipeline, and configuration of the indexing process is a synonym of writing an XSLT stylesheet which produces XML output containing the magic processing instructions or elements discussed in Section 2.5, “Canonical Indexing Format”. Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to enlighten the wary.
Stylesheets can be written in the pull or the push style: pull means that the output XML structure is taken as starting point of the internal structure of the XSLT stylesheet, and portions of the input XML are pulled out and inserted into the right spots of the output XML structure. On the other side, push XSLT stylesheets are recursively calling their template definitions, a process which is commanded by the input XML structure, and is triggered to produce some output XML whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input XML with strong and well-defined structure and semantics, like the following OAI indexing example, whereas the push type might be the only possible way to sort out deeply recursive input XML formats.
A pull stylesheet example used to index OAI harvested records could use some of the following template definitions:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:z="http://indexdata.dk/zebra-2.0" xmlns:oai="http://www.openarchives.org/&oai;/2.0/" xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0"> <!-- Example pull and magic element style Zebra indexing --> <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/> <!-- disable all default text node output --> <xsl:template match="text()"/> <!-- disable all default recursive element node transversal --> <xsl:template match="node()"/> <!-- match only on oai xml record root --> <xsl:template match="/"> <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"> <!-- you may use z:rank="{some XSLT; function here}" --> <!-- explicetly calling defined templates --> <xsl:apply-templates/> </z:record> </xsl:template> <!-- OAI indexing templates --> <xsl:template match="oai:record/oai:header/oai:identifier"> <z:index name="oai_identifier;0"> <xsl:value-of select="."/> </z:index> </xsl:template> <!-- etc, etc --> <!-- DC specific indexing templates --> <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title"> <z:index name="dc_any:w dc_title:w dc_title:p dc_title:s "> <xsl:value-of select="."/> </z:index> </xsl:template> <!-- etc, etc --> </xsl:stylesheet>
Notice also, that the names and types of the indexes can be defined in the indexing XSLT stylesheet dynamically according to content in the original XML records, which has opportunities for great power and wizardry as well as grande disaster.
The following excerpt of a push stylesheet might be a good idea according to your strict control of the XML input format (due to rigorous checking against well-defined and tight RelaxNG or XML Schema's, for example):
<xsl:template name="element-name-indexes"> <z:index name="{name()}:w"> <xsl:value-of select="'1'"/> </z:index> </xsl:template>
This template creates indexes which have the name of the working
node of any input XML file, and assigns a '1' to the index.
The example query
find @attr 1=xyz 1
finds all files which contain at least one
xyz
XML element. In case you can not control
which element names the input files contain, you might ask for
disaster and bad karma using this technique.
One variation over the theme dynamically created indexes will definitely be unwise:
<!-- match on oai xml record root --> <xsl:template match="/"> <z:record> <!-- create dynamic index name from input content --> <xsl:variable name="dynamic_content"> <xsl:value-of select="oai:record/oai:header/oai:identifier"/> </xsl:variable> <!-- create zillions of indexes with unknown names --> <z:index name="{$dynamic_content}:w"> <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/> </z:index> </z:record> </xsl:template>
Don't be tempted to play too smart tricks with the power of XSLT, the above example will create zillions of indexes with unpredictable names, resulting in severe Zebra index pollution..
It can be very hard to debug a DOM filter setup due to the many
sucessive MARC syntax translations, XML stream splitting and
XSLT transformations involved. As an aid, you have always the
power of the -s
command line switch to the
zebraidz
indexing command at your hand:
zebraidx -s -c zebra.cfg update some_record_stream.xml
This command line simulates indexing and dumps a lot of debug information in the logs, telling exactly which transformations have been applied, how the documents look like after each transformation, and which record ids and terms are send to the indexer.