Docstoc

NISO Metasearch Initiative XML Gateway Implementors Guide

Document Sample
NISO Metasearch Initiative XML Gateway Implementors Guide Powered By Docstoc
					NISO Metasearch XML Gateway Implementors Guide
                     Version 0.1
                    March 17, 2010

                     Ralph LeVan
                    OCLC Research
1 Overview
The NISO Metasearch Initiative Committee has been charged with identifying and/or
developing standards and best practices to improve interoperability between metasearch
engines and content providers. One of the goals of the committee has been to identify a
search-and-retrieve standard. The requirement for the search-and-retrieve standard was
for an “XML gateway”. Functionally, this was taken to mean that URL‟s with a standard
format would be submitted to the content provider and an XML response, conformant to
a standard schema, would be returned. A standard query grammar was explicitly
declared out-of-scope. This document describes how to construct the NISO Metasearch
XML Gateway (MXG) URL and response.

2 Incoming URL
URL‟s will be of the form:
      http://server/context?<version>&query=<query>&<optionalParms>

For example, http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=book is a
compliant URL.

2.1 Parameters
2.1.1 Structure
URL parameters are of the form name=value and are separated by „&‟. The example
above had two parameters, version and query.

2.1.2 <version>
The value for the version parameter will be defined as 1.1.

The version parameter is mandatory in all MXG URL‟s.

2.1.3 <query>
The value for the query parameter can be an arbitrary string within double-quotes or a
single word/token (without blanks). Double-quotes needed as part of the search need to
be escaped with a preceding backslash.

The query parameter is mandatory in all MXG URL‟s.

2.1.3.1 Result Set ID’s
The server may provide a resultSetId. This is a name that has been provided so that a
resultset may be referenced after the query response. Not all servers support resultsets.
ResultSetId‟s are typically used to retrieve more records from an existing result set. The
mechanism for doing this is to submit a new URL with the resultset ID in the query. The
syntax for this query is:
        query=cql.resultSetId=<resultSetId>
Where <resultSetId> is replaced with the value of the resultSetId element in the
searchRetrieveResponse. An example of a URL with a resultset reference is:
       http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=cql.resultSetId=a
       bc123&startRecord=2&maximumRecords=1
This example asks for the second record in a resultset.

2.1.4 <optionalParms>
There are two optional parameters defined in the MXG URL, startRecord and
maximumRecords.


2.1.4.1 startRecord
The client can specify which record from the result set should be the first record returned.
The value 1 specifies the first record. If omitted, the server may choose any value, but
the recommended default value is 1.


2.1.4.2 maximumRecords
The searcher can specify the maximum number of records to be returned. The server
may return fewer records. If omitted, the server may choose any value, including zero.

2.2 Examples
http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=book sends a single
term search and doesn‟t specify how many records should be returned. In this case, the
server defaults to zero records returned.

To ask for a single record from that server, send the URL:
http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=book&maximumRecord
s=1

A next record in the result set can be retrieved with the URL:
http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=book&startRecord=2&
maximumRecords=1

A more complex query can be seen in this URL:
http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query="book publishing"

3 Outgoing XML record
The XML record returned by MXG is an instance of a searchRetrieveResponse as
defined in the NISO SRW/U standard. The schema that defines the
searchRetrieveResponse, SRW Types, can be found at
http://www.loc.gov/z3950/agency/zing/srw/srw-types.xsd. It imports a schema that
defines the structure of diagnostic messages. There is also a Web Service Description
Language (WSDL) definition of the SRW service
(http://www.loc.gov/z3950/agency/zing/srw/srw-bindings.wsdl). If you are
familiar/comfortable with tools that automatically generate XML from schemas, then you
might want to consider looking into them. This guide will assume that you are hand-
constructing the XML response.

Here is a searchRetrieveResponse that includes optional resultset information and a
returned Dublin Core record:
        <?xml version="1.0" ?>
        <?xml-stylesheet type="text/xsl" href="/MXG/searchRetrieveResponse.xsl"?>
        <searchRetrieveResponse xmlns="http://www.loc.gov/zing/srw/">
           <version>1.1</version>
           <numberOfRecords>30</numberOfRecords>
           <resultSetId>717zar</resultSetId>
           <resultSetIdleTime>30</resultSetIdleTime>
           <records>
             <record>
                <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema>
                <recordPacking>xml</recordPacking>
                <recordData>
                  <srw_dc:dc xmlns="http://www.w3.org/TR/xhtml1/strict"
                     xmlns:dc="http://purl.org/dc/elements/1.1/"
                     xmlns:srw_dc="info:srw/schema/1/dc-v1.1">
                     <dc:identifier>rrl1234</dc:identifier>
                     <dc:title>Dog and Cat</dc:title>
                     </srw_dc:dc>
                  </recordData>
                <recordPosition>1</recordPosition>
                </record>
             </records>
           <nextRecordPosition>1</nextRecordPosition>
           <echoedSearchRetrieveRequest>
             <version>1.1</version>
             <query>cql.any = &quot;dog&quot;</query>
             </echoedSearchRetrieveRequest>
           </searchRetrieveResponse>

The first line of the response is the declaration that this is an XML record. It is
mandatory.

The second line is an optional stylesheet reference. Its presence allows browsers to
render the XML response into something more pleasant. In the case of the example
above, a searchable user interface is rendered. The complete URL for that stylesheet is
http://alcme.oclc.org/MXG/searchRetrieveResponse.xsl. You‟ll need to provide a pointer
to your own stylesheet, if you use this feature.

The third line is the actual searchRetrieveResponse. It includes a namespace attribute
and specifies that that namespace (http://www.loc.gov/zing/srw/) is the default
namespace for this element and all subelements. You can change the namespace prefix
in this attribute if you want to. (If you don‟t know what a namespace prefix is, don‟t
worry about it.)

The first subelement is the version element. It is mandatory and its only legal value is
“1.1” (without the quotes).

The next subelement is the numberOfRecords element. It is mandatory and its value
should be a non-negative integer. It will contain the count of the number of records that
satisfy the query.

3.1 Result Set Elements
Next come a pair of optional elements for result sets.

  <resultSetId>717zar</resultSetId>
  <resultSetIdleTime>30</resultSetIdleTime>

If the server generates resultsets that can be referenced after the query is complete, then
this is where it will specify the name for the resultset and indicate how long the resultset
will remain available.

The first element is <resultSetId>. This is an arbitrary string. It can contain anything
that is valid in XML content. (Avoid angle brackets, quotes, apostrophes and
ampersands.)

The second element is <resultSetIdleTime>. If omitted, it means that the server is not
making any promises about how long the result set will be available. If present, it will
contain a non-negative number whose value is the number of seconds that the resultset
will remain available after its last use. Essentially, a countdown timer is started each
time the resultset is used. When it reaches zero, the resultset can be thrown away. As
long as the client references the resultset before the timer reaches zero, then the timer
should be reset.

A resultset idle time is not a guarantee; it is a promise of best effort. The server is always
permitted, as necessary, to throw resultsets away arbitrarily. If a resultset that no longer
exists is later referenced, then the server should issue a diagnostic. (See the section
below on diagnostics.)

3.2 Record Elements
Next come the elements needed to return records.

  <records>
    <record>
      <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema>
      <recordPacking>xml</recordPacking>
      <recordData>
          <srw_dc:dc xmlns="http://www.w3.org/TR/xhtml1/strict"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:srw_dc="info:srw/schema/1/dc-v1.1">
            <dc:identifier>rrl1234</dc:identifier>
            <dc:title>Dog and Cat</dc:title>
            </srw_dc:dc>
          </recordData>
       <recordPosition>1</recordPosition>
       </record>
     </records>

The first element is the wrapper element, <records>. As its name implies, it can be used
to hold more than one <record> elements.

The only legal element within the <records> element is the <record> element. It may
repeat any number of times. There are three mandatory elements within the <record>
element; <recordSchema>, <recordPacking> and <recordData> and one optional element
<recordPosition>.

The mandatory <recordSchema> element contains the identifier of the schema used for
the returned record. There are no restrictions on the schemas used. The only requirement
is that you specify which schema is being used. Unique data may require the creation of
schemas unique to the server. This is perfectly legal.

The mandatory <recordPacking> element specifies how the content in the <recordData>
element is structured. For MXG, the only legal value is “xml” (without the quotes.) See
http://www.loc.gov/z3950/agency/zing/srw/records.html for more information on this
subject.

The mandatory <recordData> element contains the record itself. In the example above,
this is a Dublin Core record, but the server can use any record schema to encode its data.

The optional <recordPosition> element contains a positive integer whose value is the
position of the returned record in the result set. It is present to support thin clients and its
presence is strongly recommended. See the section below on the importance of thin
clients.

3.3 Browser Elements
The optional <echoedSearchRetrieveRequest> element is provided solely to support thin
clients and its presence is strongly recommended. See the section below on the
importance of thin clients.

  <echoedSearchRetrieveRequest>
    <version>1.1</version>
    <query>cql.any = &quot;dog&quot;</query>
    </echoedSearchRetrieveRequest>
The <echoedSearchRetrieveRequest> element contains all the parameters that the client
specified in its request. Since the request has two mandatory parameters, version and
query, the <echoedSearchRetrieveRequest> element has two mandatory subelements,
<version> and <query>. Their values should be identical to the values provided in the
request, but can be modified to make the safe for inclusion in XML. In the example
above, notice that the quotes around the search term “dog” were replaced with the &quot;
entity.

The <startRecord> and <maximumRecords> elements are optional in the schema, but
must be returned if their corresponding parameters were specified in the request URL.

3.4 Diagnostic Elements
Any response may include any number of diagnostic messages. The presence of
diagnostics does not imply anything about the success or failure of the request; they may
contain purely informatory information.

  <diagnostics>
    <diagnostic xmlns="http://www.loc.gov/zing/srw/diagnostic/">
      <uri>info:srw/diagnostic/1/51</uri>
      <details>66ntqk</details>
      </diagnostic>
    </diagnostics>

The first element is the wrapper element, <diagnostics>. As its name implies, it can be
used to hold more than one <diagnostic> elements.

The only legal element within the <diagnostics> element is the <diagnostic> element. It
may repeat any number of times. There is one mandatory element within the
<diagnostic> element; <uri> and two optional elements <details> and <message>.

The mandatory <uri> element contains the unique identifier for the diagnostic. A list of
standard diagnostics can be found at:
http://www.loc.gov/z3950/agency/zing/srw/diagnostics-list.html. The server can define
its own diagnostics, if necessary, but it is strongly recommended that standard diagnostics
be used. (It isn‟t hard to get new diagnostics added to the standard set.)

The optional <details> element contains extra information associated with the diagnostic.
In the example above, diagnostic info:srw/diagnostic/1/51 says that the resultset does not
exist. The accompanying <details> element contains the name of the resultset that was
specified in the request. The details for a diagnostic are defined by the diagnostic. The
table of diagnostics linked above includes the optional details for those cases where they
are defined.

The optional <message> element can contain a human readable message. This is a
limited capability as the message element can only occur once, so the message can only
be in a single language. It is believed that the diagnostic URI and accompanying details
should be sufficient to formulate language appropriate messages.

4 NISO Metasearch XML Gateway Compliance
4.1 On the URL
A compliant server must require the version parameter with a value of 1.1. The server
must accept the query, startRecord and maximumRecords parameters. The server may
accept other parameters. The server may ignore any other parameters on the request, but
it is strongly recommended that diagnostic messages be issued for each ignored
parameter.

Strictly speaking, the query must be a compliant CQL query. CQL (Common Query
Language) is a feature of NISO SRW/U and supports high functionality, highly
interoperable searches. (See the section below on the relationship between MXG and
NISO SRW/U.) It is believed by the NISO Metasearch Initiative Committee that full
CQL support is beyond the means/needs of many of the members of the content provider
community. Therefore, a trivial subset of CQL has been specified that will support
passing non-CQL queries in the CQL context.

Queries can be of the form query=<word> or query=”<phrase>” where <word> is any
string of characters with no embedded blanks („ „). An example of a compliant <word>
query is: query=book

A <phrase> is any string of characters except the unescaped quote („”‟) and backslash
(„\‟). Quotes and backslashes can be escaped in a <phrase> by preceding them with a
backslash („\‟). An example of a compliant <phrase> query is: query=”find cat w/1
house”

4.2 On the Response
The server must provide a well-formed XML response whose root element is
<searchRetrieveResponse>. The XML record must be valid according to the schema at
http://www.loc.gov/z3950/agency/zing/srw/srw-types.xsd.

5 Relationship of this standard with NISO SRW/U
The NISO Metasearch XML Gateway is a non-conformant subset of the NISO SRW/U
standard (http://www.loc.gov/z3950/agency/zing/srw/). The features missing from MXG
that are necessary for SRU conformance are support for an Explain record and rich CQL
support.

MXG has been designed to provide a low implementation barrier to content providers
that want to make their databases available to metasearch engines. Interoperability across
content providers was explicitly not a goal of MXG. The features of SRU that are
missing from MXG are necessary for interoperability.
5.1 Explain
An MXG client learns about the capabilities of an MXG server through out-of-band
agreements. SRU requires that servers provide information about their capabilities
through a standard mechanism; the Explain record. The Explain record contains a list of
the indexes that can be used in a CQL search and a list of the record schemas that can be
used when requesting database records. It provides a human readable description of the
database and contact information. It can describe default behaviors.

More information about Explain can be found at:
http://www.loc.gov/z3950/agency/zing/srw/explain.html.

5.2 CQL
MXG does not define a standard query grammar. It was agreed that converting the query
grammar of the metasearch engine into the query grammar of the content provider was a
task that the metasearch engine would do. The SRW/U community is committed to
search engines supporting a standard query grammar: CQL. Supporting a standard query
grammar reduces the complexity of the client and allows more clients and metasearch
engines to search the server‟s content. More information about CQL can be found at:
http://www.loc.gov/z3950/agency/zing/cql/.

6 Thin Clients
“Thin Clients” are clients with little or no application specific intelligence. The
“thinnest” client is a web browser. SRU was designed to support browser based clients.
It does this with three mechanisms: The server can deliver a stylesheet along with its
response, the client can request that a specific stylesheet be returned with the response
and the server can provide context information within the response via the echoed request
element.

6.1 Stylesheets
Stylesheets allow a browser to render the server‟s XML into HTML. A simple example
can be found at http://alcme.oclc.org/MXG/search/ORPubs. This is a request with no
parameters and results in an Explain record being returned. Along with the Explain
record is a stylesheet reference. (Click View/Source to see the raw XML and stylesheet
reference.) The browser combines the XML response with the stylesheet to render a
simple search interface. More complicated, vendor specific, interfaces can be generated.
A search for “levan” in the cql.any index will result in a <searchRetrieveResponse>
along with a different stylesheet reference. This stylesheet generates a search results
page including the rendered document.

6.2 Client Specified Stylesheets
An SRU parameter, stylesheet=, allows the client to specify the stylesheet to be returned.
This feature is not as useful as originally intended. The intent was that a user could
specify a stylesheet that the user owned, allowing the user to control the interface to all
SRU servers.
It turns out that browser security features limit this capability. Default browser security
prohibits fetching stylesheets from any server other than the one that the XML comes
from. Therefore, the client is limited to stylesheets that the server owns. Some servers
are willing to cache user stylesheets so that they can then later request them with a
response, but this is not widely implemented.

6.3 Echoed Context Information
Thin clients are context free. Once they‟ve sent off a search, they‟ve forgotten what the
search was. It is typically thought useful for a result set display to include the query that
generated the result set. For that to happen with a thin client, the query must be sent back
to the client within the response. That is the purpose of the echoed request element. All
of the parameters on the original request can (and should) be sent back to the client for
potential display through the stylesheet.

				
DOCUMENT INFO