NISO Metasearch XML Gateway Implementors Guide Version 0.1 March 17, 2010 Ralph LeVan OCLC Research 1 Overview The NISO Metasearch Initiative Committee has been charged with identifying and/or developing standards and best practices to improve interoperability between metasearch engines and content providers. One of the goals of the committee has been to identify a search-and-retrieve standard. The requirement for the search-and-retrieve standard was for an “XML gateway”. Functionally, this was taken to mean that URL‟s with a standard format would be submitted to the content provider and an XML response, conformant to a standard schema, would be returned. A standard query grammar was explicitly declared out-of-scope. This document describes how to construct the NISO Metasearch XML Gateway (MXG) URL and response. 2 Incoming URL URL‟s will be of the form: http://server/context?<version>&query=<query>&<optionalParms> For example, http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=book is a compliant URL. 2.1 Parameters 2.1.1 Structure URL parameters are of the form name=value and are separated by „&‟. The example above had two parameters, version and query. 2.1.2 <version> The value for the version parameter will be defined as 1.1. The version parameter is mandatory in all MXG URL‟s. 2.1.3 <query> The value for the query parameter can be an arbitrary string within double-quotes or a single word/token (without blanks). Double-quotes needed as part of the search need to be escaped with a preceding backslash. The query parameter is mandatory in all MXG URL‟s. 126.96.36.199 Result Set ID’s The server may provide a resultSetId. This is a name that has been provided so that a resultset may be referenced after the query response. Not all servers support resultsets. ResultSetId‟s are typically used to retrieve more records from an existing result set. The mechanism for doing this is to submit a new URL with the resultset ID in the query. The syntax for this query is: query=cql.resultSetId=<resultSetId> Where <resultSetId> is replaced with the value of the resultSetId element in the searchRetrieveResponse. An example of a URL with a resultset reference is: http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=cql.resultSetId=a bc123&startRecord=2&maximumRecords=1 This example asks for the second record in a resultset. 2.1.4 <optionalParms> There are two optional parameters defined in the MXG URL, startRecord and maximumRecords. 188.8.131.52 startRecord The client can specify which record from the result set should be the first record returned. The value 1 specifies the first record. If omitted, the server may choose any value, but the recommended default value is 1. 184.108.40.206 maximumRecords The searcher can specify the maximum number of records to be returned. The server may return fewer records. If omitted, the server may choose any value, including zero. 2.2 Examples http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=book sends a single term search and doesn‟t specify how many records should be returned. In this case, the server defaults to zero records returned. To ask for a single record from that server, send the URL: http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=book&maximumRecord s=1 A next record in the result set can be retrieved with the URL: http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query=book&startRecord=2& maximumRecords=1 A more complex query can be seen in this URL: http://alcme.oclc.org/MXG/search/ORPubs?version=1.1&query="book publishing" 3 Outgoing XML record The XML record returned by MXG is an instance of a searchRetrieveResponse as defined in the NISO SRW/U standard. The schema that defines the searchRetrieveResponse, SRW Types, can be found at http://www.loc.gov/z3950/agency/zing/srw/srw-types.xsd. It imports a schema that defines the structure of diagnostic messages. There is also a Web Service Description Language (WSDL) definition of the SRW service (http://www.loc.gov/z3950/agency/zing/srw/srw-bindings.wsdl). If you are familiar/comfortable with tools that automatically generate XML from schemas, then you might want to consider looking into them. This guide will assume that you are hand- constructing the XML response. Here is a searchRetrieveResponse that includes optional resultset information and a returned Dublin Core record: <?xml version="1.0" ?> <?xml-stylesheet type="text/xsl" href="/MXG/searchRetrieveResponse.xsl"?> <searchRetrieveResponse xmlns="http://www.loc.gov/zing/srw/"> <version>1.1</version> <numberOfRecords>30</numberOfRecords> <resultSetId>717zar</resultSetId> <resultSetIdleTime>30</resultSetIdleTime> <records> <record> <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema> <recordPacking>xml</recordPacking> <recordData> <srw_dc:dc xmlns="http://www.w3.org/TR/xhtml1/strict" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:srw_dc="info:srw/schema/1/dc-v1.1"> <dc:identifier>rrl1234</dc:identifier> <dc:title>Dog and Cat</dc:title> </srw_dc:dc> </recordData> <recordPosition>1</recordPosition> </record> </records> <nextRecordPosition>1</nextRecordPosition> <echoedSearchRetrieveRequest> <version>1.1</version> <query>cql.any = "dog"</query> </echoedSearchRetrieveRequest> </searchRetrieveResponse> The first line of the response is the declaration that this is an XML record. It is mandatory. The second line is an optional stylesheet reference. Its presence allows browsers to render the XML response into something more pleasant. In the case of the example above, a searchable user interface is rendered. The complete URL for that stylesheet is http://alcme.oclc.org/MXG/searchRetrieveResponse.xsl. You‟ll need to provide a pointer to your own stylesheet, if you use this feature. The third line is the actual searchRetrieveResponse. It includes a namespace attribute and specifies that that namespace (http://www.loc.gov/zing/srw/) is the default namespace for this element and all subelements. You can change the namespace prefix in this attribute if you want to. (If you don‟t know what a namespace prefix is, don‟t worry about it.) The first subelement is the version element. It is mandatory and its only legal value is “1.1” (without the quotes). The next subelement is the numberOfRecords element. It is mandatory and its value should be a non-negative integer. It will contain the count of the number of records that satisfy the query. 3.1 Result Set Elements Next come a pair of optional elements for result sets. <resultSetId>717zar</resultSetId> <resultSetIdleTime>30</resultSetIdleTime> If the server generates resultsets that can be referenced after the query is complete, then this is where it will specify the name for the resultset and indicate how long the resultset will remain available. The first element is <resultSetId>. This is an arbitrary string. It can contain anything that is valid in XML content. (Avoid angle brackets, quotes, apostrophes and ampersands.) The second element is <resultSetIdleTime>. If omitted, it means that the server is not making any promises about how long the result set will be available. If present, it will contain a non-negative number whose value is the number of seconds that the resultset will remain available after its last use. Essentially, a countdown timer is started each time the resultset is used. When it reaches zero, the resultset can be thrown away. As long as the client references the resultset before the timer reaches zero, then the timer should be reset. A resultset idle time is not a guarantee; it is a promise of best effort. The server is always permitted, as necessary, to throw resultsets away arbitrarily. If a resultset that no longer exists is later referenced, then the server should issue a diagnostic. (See the section below on diagnostics.) 3.2 Record Elements Next come the elements needed to return records. <records> <record> <recordSchema>info:srw/schema/1/dc-v1.1</recordSchema> <recordPacking>xml</recordPacking> <recordData> <srw_dc:dc xmlns="http://www.w3.org/TR/xhtml1/strict" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:srw_dc="info:srw/schema/1/dc-v1.1"> <dc:identifier>rrl1234</dc:identifier> <dc:title>Dog and Cat</dc:title> </srw_dc:dc> </recordData> <recordPosition>1</recordPosition> </record> </records> The first element is the wrapper element, <records>. As its name implies, it can be used to hold more than one <record> elements. The only legal element within the <records> element is the <record> element. It may repeat any number of times. There are three mandatory elements within the <record> element; <recordSchema>, <recordPacking> and <recordData> and one optional element <recordPosition>. The mandatory <recordSchema> element contains the identifier of the schema used for the returned record. There are no restrictions on the schemas used. The only requirement is that you specify which schema is being used. Unique data may require the creation of schemas unique to the server. This is perfectly legal. The mandatory <recordPacking> element specifies how the content in the <recordData> element is structured. For MXG, the only legal value is “xml” (without the quotes.) See http://www.loc.gov/z3950/agency/zing/srw/records.html for more information on this subject. The mandatory <recordData> element contains the record itself. In the example above, this is a Dublin Core record, but the server can use any record schema to encode its data. The optional <recordPosition> element contains a positive integer whose value is the position of the returned record in the result set. It is present to support thin clients and its presence is strongly recommended. See the section below on the importance of thin clients. 3.3 Browser Elements The optional <echoedSearchRetrieveRequest> element is provided solely to support thin clients and its presence is strongly recommended. See the section below on the importance of thin clients. <echoedSearchRetrieveRequest> <version>1.1</version> <query>cql.any = "dog"</query> </echoedSearchRetrieveRequest> The <echoedSearchRetrieveRequest> element contains all the parameters that the client specified in its request. Since the request has two mandatory parameters, version and query, the <echoedSearchRetrieveRequest> element has two mandatory subelements, <version> and <query>. Their values should be identical to the values provided in the request, but can be modified to make the safe for inclusion in XML. In the example above, notice that the quotes around the search term “dog” were replaced with the " entity. The <startRecord> and <maximumRecords> elements are optional in the schema, but must be returned if their corresponding parameters were specified in the request URL. 3.4 Diagnostic Elements Any response may include any number of diagnostic messages. The presence of diagnostics does not imply anything about the success or failure of the request; they may contain purely informatory information. <diagnostics> <diagnostic xmlns="http://www.loc.gov/zing/srw/diagnostic/"> <uri>info:srw/diagnostic/1/51</uri> <details>66ntqk</details> </diagnostic> </diagnostics> The first element is the wrapper element, <diagnostics>. As its name implies, it can be used to hold more than one <diagnostic> elements. The only legal element within the <diagnostics> element is the <diagnostic> element. It may repeat any number of times. There is one mandatory element within the <diagnostic> element; <uri> and two optional elements <details> and <message>. The mandatory <uri> element contains the unique identifier for the diagnostic. A list of standard diagnostics can be found at: http://www.loc.gov/z3950/agency/zing/srw/diagnostics-list.html. The server can define its own diagnostics, if necessary, but it is strongly recommended that standard diagnostics be used. (It isn‟t hard to get new diagnostics added to the standard set.) The optional <details> element contains extra information associated with the diagnostic. In the example above, diagnostic info:srw/diagnostic/1/51 says that the resultset does not exist. The accompanying <details> element contains the name of the resultset that was specified in the request. The details for a diagnostic are defined by the diagnostic. The table of diagnostics linked above includes the optional details for those cases where they are defined. The optional <message> element can contain a human readable message. This is a limited capability as the message element can only occur once, so the message can only be in a single language. It is believed that the diagnostic URI and accompanying details should be sufficient to formulate language appropriate messages. 4 NISO Metasearch XML Gateway Compliance 4.1 On the URL A compliant server must require the version parameter with a value of 1.1. The server must accept the query, startRecord and maximumRecords parameters. The server may accept other parameters. The server may ignore any other parameters on the request, but it is strongly recommended that diagnostic messages be issued for each ignored parameter. Strictly speaking, the query must be a compliant CQL query. CQL (Common Query Language) is a feature of NISO SRW/U and supports high functionality, highly interoperable searches. (See the section below on the relationship between MXG and NISO SRW/U.) It is believed by the NISO Metasearch Initiative Committee that full CQL support is beyond the means/needs of many of the members of the content provider community. Therefore, a trivial subset of CQL has been specified that will support passing non-CQL queries in the CQL context. Queries can be of the form query=<word> or query=”<phrase>” where <word> is any string of characters with no embedded blanks („ „). An example of a compliant <word> query is: query=book A <phrase> is any string of characters except the unescaped quote („”‟) and backslash („\‟). Quotes and backslashes can be escaped in a <phrase> by preceding them with a backslash („\‟). An example of a compliant <phrase> query is: query=”find cat w/1 house” 4.2 On the Response The server must provide a well-formed XML response whose root element is <searchRetrieveResponse>. The XML record must be valid according to the schema at http://www.loc.gov/z3950/agency/zing/srw/srw-types.xsd. 5 Relationship of this standard with NISO SRW/U The NISO Metasearch XML Gateway is a non-conformant subset of the NISO SRW/U standard (http://www.loc.gov/z3950/agency/zing/srw/). The features missing from MXG that are necessary for SRU conformance are support for an Explain record and rich CQL support. MXG has been designed to provide a low implementation barrier to content providers that want to make their databases available to metasearch engines. Interoperability across content providers was explicitly not a goal of MXG. The features of SRU that are missing from MXG are necessary for interoperability. 5.1 Explain An MXG client learns about the capabilities of an MXG server through out-of-band agreements. SRU requires that servers provide information about their capabilities through a standard mechanism; the Explain record. The Explain record contains a list of the indexes that can be used in a CQL search and a list of the record schemas that can be used when requesting database records. It provides a human readable description of the database and contact information. It can describe default behaviors. More information about Explain can be found at: http://www.loc.gov/z3950/agency/zing/srw/explain.html. 5.2 CQL MXG does not define a standard query grammar. It was agreed that converting the query grammar of the metasearch engine into the query grammar of the content provider was a task that the metasearch engine would do. The SRW/U community is committed to search engines supporting a standard query grammar: CQL. Supporting a standard query grammar reduces the complexity of the client and allows more clients and metasearch engines to search the server‟s content. More information about CQL can be found at: http://www.loc.gov/z3950/agency/zing/cql/. 6 Thin Clients “Thin Clients” are clients with little or no application specific intelligence. The “thinnest” client is a web browser. SRU was designed to support browser based clients. It does this with three mechanisms: The server can deliver a stylesheet along with its response, the client can request that a specific stylesheet be returned with the response and the server can provide context information within the response via the echoed request element. 6.1 Stylesheets Stylesheets allow a browser to render the server‟s XML into HTML. A simple example can be found at http://alcme.oclc.org/MXG/search/ORPubs. This is a request with no parameters and results in an Explain record being returned. Along with the Explain record is a stylesheet reference. (Click View/Source to see the raw XML and stylesheet reference.) The browser combines the XML response with the stylesheet to render a simple search interface. More complicated, vendor specific, interfaces can be generated. A search for “levan” in the cql.any index will result in a <searchRetrieveResponse> along with a different stylesheet reference. This stylesheet generates a search results page including the rendered document. 6.2 Client Specified Stylesheets An SRU parameter, stylesheet=, allows the client to specify the stylesheet to be returned. This feature is not as useful as originally intended. The intent was that a user could specify a stylesheet that the user owned, allowing the user to control the interface to all SRU servers. It turns out that browser security features limit this capability. Default browser security prohibits fetching stylesheets from any server other than the one that the XML comes from. Therefore, the client is limited to stylesheets that the server owns. Some servers are willing to cache user stylesheets so that they can then later request them with a response, but this is not widely implemented. 6.3 Echoed Context Information Thin clients are context free. Once they‟ve sent off a search, they‟ve forgotten what the search was. It is typically thought useful for a result set display to include the query that generated the result set. For that to happen with a thin client, the query must be sent back to the client within the response. That is the purpose of the echoed request element. All of the parameters on the original request can (and should) be sent back to the client for potential display through the stylesheet.