Using TEXTTML server for XML content management

Document Sample
Using TEXTTML server for XML content management Powered By Docstoc
					TEXTML Server
Native XML Storage and Information Retrieval


January 2005

Why Use TEXTML Server?................................................................................... 3 New Database Needs for Document-Centric Applications One XML Back-End Server for Various Solutions An Embeddable Component 3 5 6

TEXTML Server Structure Overview...................................................................... 7 Close-up on TEXTML Server................................................................................ 8 Document Base TEXTML Server’s Indexing Engine – a Novel Approach to Indexing Search Engine TEXTML RSA - Replication Service Agent TEXTML FTS - Fault Tolerance Server 8 9 12 13 16

Summary....................................................................................................... 17 Other Related Documentation ........................................................................... 18 TEXTML Server User Documentation Additional Product Information 18 18

TEXTML Server Whitepaper


Why Use TEXTML Server?
This Whitepaper will discuss the architecture and approach used to develop TEXTML Server, the industry leading native XML repository and search engine.

New Database Needs for Document-Centric Applications
XML is everywhere
With the proliferation of XML-based applications, XML has raised expectations of how information can be leveraged to increase productivity and efficiency. Whether in Publishing, Aerospace, Financial Services, Life Sciences or Health Care, XML has become a standard technology for more and more industries and is used in virtually all industries and in a wide variety of document-centric applications to optimize document management: Publishing Editorial content management Online archiving Digital asset management Ad management Content syndication Aerospace Production of technical documentation Interactive Electronic Technical Manuals (IETM) Knowledge management Financial Services Data exchange Web content management Standardized business reporting processes Business process management Life Sciences E-Learning Knowledge management Health Care Online archiving Knowledge management

TEXTML Server Whitepaper


XML challenges
As organizations embrace XML and the concept of "write content once, re-use at will", IT professionals empowered to deliver the next generation of XML applications have come to realize that traditional database models are ill-suited for storing, indexing and retrieving rich XML content: Content is no longer of a purely transactional nature, nor is it purely multimedia, or solely textual, but rather a hybrid of the three forms of content. Existing relational databases, which were conceived to support data-centric applications, are not suited to managing the kind of XML content that document management applications must handle. XML content is of unpredictable, semistructured nature and subject to change at any point. Databases must adapt to support such content. XML, by its very nature, presents many challenges to the traditional RDBMS model. It offers the ability to create deeply hierarchical, multitiered structures that enable nested values that vary in length and type. It is important to be able to manage, in a streamlined way, a structure that contains empty or missing elements and whose ordering is important. XML content does not map well to existing object-oriented and relational database models. Storing an XML document in a traditional database requires that the XML structure be mapped to a predefined database schema, thereby requiring the decomposition of the XML document in order to explode it into a series of inter-related tables. This process is often resource intensive and results in the loss of some data such as processing instructions and comments as well as the notion of element and attribute ordering – making the XML document hierarchy irrelevant. Why bother creating XML content if you are just going to destroy it upon its storage? In addition, if your XML Schema changes even slightly, your database structure will be disrupted, prompting often massive updates to be required to hundreds of tables.

Native XML Storage is Key
TEXTML Server is a native XML content server, with three main advantages over most traditional databases which may have been “XML-enabled”: The XML document is the basic unit of storage. When an XML document is inserted into TEXTML Server, it is preserved as a separate, unique object without any modification. Element and attribute ordering is preserved; the original document, including processing instructions, comments, etc., is completely preserved. When the document is retrieved it is 100% intact. TEXTML Server uses the markup in XML to design and build its indexes. The result is a streamlined repository structure that gives rise to superior content search and retrieval performances. TEXTML Server is DTD and Schema independent. In fact, TEXTML Server is designed specifically to manage XML collections that are heterogeneous in structure. In other words, TEXTML Server can easily manage XML that comes from multiple schemas or DTDs. Any well-formed XML document can be stored and queried in TEXTML Server. No Schema or DTD is required, making mapping of the XML structure to a predetermined database structure and all the limitations associated with that procedure unnecessary. TEXTML Server uses an XML-specific query language. TEXTML Server’s query language is designed specifically to query the hierarchical nature of XML documents. It enables extremely fast retrieval of content and can search any element or attribute of XML documents even if they come from different Schemas/DTDs.

TEXTML Server Whitepaper


One XML Back-End Server for Various Solutions
TEXTML Server is a back-end server designed to store, index and retrieve information contained in large repositories. It can be seen as a building block for applications that need to manage large amounts of XML information, or for any application which contains loosely structured information that cannot be efficiently handled by traditional relational database systems. The fact that it is designed with XML at its core makes TEXTML Server extremely versatile and ideally suited to a wide variety of applications.

Using TEXTML Server at the Heart of a Solution
As an example, TEXTML Server may be used as the central content repository in a multichannel publishing application.

Figure 1 - TEXTML Server as centralized repository

TEXTML Server Whitepaper


Using TEXTML Server Alongside Specialized Back-End Systems
In a complex integrated application, such as a commercial Internet site, TEXTML Server may be used alongside a relational database such as Microsoft SQL Server, a billing server, a mail server or other specialized back-end systems.

Figure 2 – TEXTML Server with specialized back-end systems

An Embeddable Component
TEXTML Server has been designed to embed seamlessly in third party applications. It is aimed at developers of document-centric applications needing to add the efficiency of native XML storage and powerful search engine to their applications. Through its extensive set of APIs which cover all server-based functions, TEXTML Server is easily integrated into the development environment of your choice. It drives application development time and cost down and accelerates time-to-market. Since TEXTML Server was designed with OEM integration in mind, it can be installed automatically as an embedded component of a third party application and remain totally invisible to the end-user. TEXTML Server uses hardware and software resources very efficiently. It takes advantage of computers equipped with multiple processors and large amount of RAM. However, it is also very efficient on machines with limited resources such as laptops. TEXTML Server’s OEM-geared pricing model is very flexible and accommodates various deployment scenarios. OEM contracts proposed by IXIASOFT grant our OEM partners development licenses and integration support as well as training on TEXTML Server’s architecture and APIs.

TEXTML Server Whitepaper


TEXTML Server Structure Overview
TEXTML Server consists of: A series of document bases, each comprised of a repository and indexes. The repository is designed to store XML documents but it can also accommodate any binary file format. Content stored in the repository may be arranged in collections that can be secured with access rights. The repository features various functions such as document check-in/check-out, version control and replication. TEXTML Server allows defining five types of indexes: word, string, date, time, and numeric. A powerful indexing engine, which creates custom, dynamic indexes based on specific application and business requirements. Thanks to this dynamic indexing engine, TEXTML Server can parse documents and update indexes as soon as the documents are added, modified or deleted. A search engine for efficient content retrieval and sorting of results. The search engine enables users to perform powerful searches on any type of index. It allows them to use search operators, such as Boolean, proximity, frequency and priority operators, as well as wildcards and right/left truncation. Various APIs, such as COM, Java, .Net, WebDAV and OLE-DB, through which applications can communicate with TEXTML Server.

Figure 3 - TEXTML Server’s Document Base structure

TEXTML Server Whitepaper


Close-up on TEXTML Server
Document Base
Each Document Base, at the core of TEXTML Server, contains the repository and indexes. It stores any XML and non-XML document. A single Document Base can efficiently store millions of objects in a secure environment. Its flexible structure lets it accommodate any binary object, making it an ideal central repository for all types of information. One TEXTML Server instance may have multiple Document Bases, each with its own security, hierarchical collections and indexes, enabling TEXTML Server to serve multiple applications simultaneously. Document Bases feature: Document check-in/check-out to control concurrent modifications to documents. As long as a user is modifying a document, no other user can modify it at the same time. However, the document remains available in read-only mode for searching. Version control to keep track of all the versions of a document and roll back to previous versions. A complete version control API provides the flexibility to manage multiple versions of a document, access previous versions and control the number of versions maintained. Collections to organize documents hierarchically. Arranging documents into collections offers a flexible way to control access to documents and allow users to perform searches on specific collections. Collections can contain other collections and documents. Collection level security to control access to specific document collections through access rights. Access to documents is ruled by permissions set at the collection level. A user is allowed to view, modify and/or delete the documents contained in the collections for which he/she has proper access rights. The system automatically hides from search results the documents that a user is not allowed to view; he/she will never know the documents even exist in the repository. Repository replication to distribute content and enhance system reliability. The Replication Service Agent (RSA) allows synchronizing repository content between several servers and Document Bases to balance search loads, distribute content among remote servers or distribute content on-demand to stand-alone computers, for example. Integrity checks to ensure repository consistency. TEXTML Server automatically detects corruptions that may occur if the system fails and is able to automatically make necessary corrections.

TEXTML Server Whitepaper


Document indexed content to allow for complete indexing of non-XML documents. The repository allows associating an XML document that can be entirely indexed to a nonXML document. For example, an XMP file, which is expressed as XML and completely indexable, can be embedded into a JPEG file to allow applications to manage images via embedded metadata. The JPEG file’s indexed content is the embedded XMP file. Note that all documents, whatever their type, have their properties indexed automatically, which allows performing searches on properties, such as document name, author, creation date, etc. Plug-in support to allow specific processing upon document storage. For example, it is possible to develop plug-ins that dynamically create a document’s indexed content. Such a plug-in is provided free of charge with TEXTML Server to extract XMP metadata from a non-XML document. TEXTML Universal Converter is a rich conversion feature enabling the conversion of over 225 file formats (such as word processor, spreadsheet, presentation, drawing and bitmap) to XML. Completely integrated into TEXTML Server (version 3.5 and above), TEXTML Universal Converter provides all information on a document’s contents, presentation information and metadata. As opposed to traditional text extractors, TEXTML Universal Converter gives structures to converted documents and enables users to take full advantage of TEXTML Server’s advanced searching capabilities. TEXTML Universal Converter normalizes all of the information to an XML schema provided in the form of a DTD (SearchML). The application developed with TEXTML Server can then either directly consume the XML or further transform it to a different schema more appropriate to the ISV’s usage providing flexibility and performance. IXIASOFT integrated Stellent’s proven Outside In XML Export technology into TEXTML Server to provide users with the most reliable XML conversion technology available on the market today.

TEXTML Server’s Indexing Engine – a Novel Approach to Indexing
Traditional Indexing
Most XML-enabled database systems provide some sort of basic indexing features. Index structures tend to be closely linked to the Schemas of the XML documents that they contain. In turn, the application logic that queries these databases must be intimately tied to the underlying Schema. This entails that application logic must sometimes be modified when the Schema of the XML documents evolves or is modified. These modifications are expensive in time and resources, especially considering the fact that in real life, a Schema is likely to evolve and be modified a number of times during development, or even later, after deployment of an application. Unfortunately, the one-to-one mapping of indexes to elements in most relational databases does not allow for much flexibility.

TEXTML Server Whitepaper


Conceptual Indexing, a Flexible Approach
TEXTML Server uses a rather different approach to indexing and querying, which we call “conceptual” indexing. In TEXTML Server, the index structure is designed by the database designer. The designer has the choice of deciding which elements and/or attributes will be indexed or not, and which type of index will be created on the content of these elements and attributes. At the heart of this approach is TEXTML Server’s “Index Definition”, a powerful concept that sets TEXTML Server apart. An index can accommodate multiple elements and attributes. For example, you may create an index called “Story Summary” which will enable you to search on the values contained in the elements <abstract>, <lead> and <summary>.

The Index Definition at the Heart of Conceptual Indexing
The indexing process relies on a configuration file, called “Index Definition” (which is itself an XML document stored in TEXTML Server). The Index Definition contains the declaration of the elements and attributes that must be indexed. The Index Definition uses XPath expressions and Namespaces to declare precisely what should be indexed. The Index Definition gives the database designer the ability to index elements or attributes from different XML Schemas/DTDs in the same index. For example, you may want to create an index called “Author” that allows you to search the contents of the element <StoryAuthor> in one document and <AuthorName> in a second. TEXTML Server’s Index Definition allows you to logically group these semantically similar elements into one unique index. The Index Definition provides a powerful feature in TEXTML Server as it constitutes a layer of abstraction above the XML content, making TEXTML Server’s indexing strategy so unique.

Figure 4 - TEXTML Server Index Definition
TEXTML Server Whitepaper 10

Indexes at the Heart of the Index Definition
Indexes are the special ingredient that powers all search capabilities. TEXTML Server indexes store the exact position of each occurrence of each word or string in each document. During the indexing process, the indexing engine analyzes the Index Definition, fetches the relevant information in each document and references that information in the appropriate indexes. Should the indexing process come across empty or missing elements and attributes, no additional space is required. In a RDBMS, an empty element still takes up storage space, a very inefficient fact. TEXTML Server’s indexing process ensures that indexes do not grow exponentially as new documents are added to the Document Base.

Figure 5 - Indexing process and index structures TEXTML Server supports five types of indexes: Word indexes, which allow searches for any word in XML elements. String indexes, which allow searches for the exact sequences of characters in elements or attributes. Numeric indexes, which allow searches for integers, floats or numeric ranges in elements or attributes. Date indexes, which allow searches for dates or date ranges in elements or attributes. Time indexes, which allow searches for times or time ranges in elements or attributes.

TEXTML Server Whitepaper


Another example of the flexibility of TEXTML Server’s indexing strategy is the date index. TEXTML Server’s date indexes may accommodate more than 90 different date formats in a single index, enabling an application to search a single index for dates expressed in both U.S. and British English, Canadian French and Swiss German. Since indexes update dynamically upon each document modification, addition or deletion, the indexes already contain terms and values and their position within each document as they are added. When a user performs a search, TEXTML Server queries the appropriate indexes only and returns precise results fast and using minimal resources.

Conceptual Indexing Benefits
Ease of maintenance. When the XML document structure changes (and it frequently does!) the only changes to be made are in the Index Definition. The application code need not be touched at all. In fact, as the indexing process in TEXTML Server is dynamic, the application does not even have to be stopped! Modifications may be performed while users are still querying TEXTML Server. Support of heterogeneous content is made easy. In real life situations, developers often come across the need to store, index and search XML documents that come from a variety of sources and that use different Schemas/DTDs. The Index Definition enables the applications to accommodate XML content with completely different structures in a central repository where the structure of indexes can be updated easily, with minimal disruption to the application. According to our customer base, the simplicity of maintenance through the ability to update the Index Definition is the single most popular feature of TEXTML Server. Rich, typed indexes allow for efficient searches and result sorting. TEXTML Server indexes allow for word search, in full-text mode, as well as sting, date, time and numeric value/range search. The ability to logically group related elements and attributes into indexes enables precise queries and optimized performance. Since TEXTML Server indexes may also be used for sorting of result sets, extensive multi-criteria sorting on any element or attribute provides maximum flexibility in ordering results, even in a result set of millions of documents.

Search Engine
With the use of indexes, sort and search performance in TEXTML Server remains constant.

Advanced Search Capabilities
TEXTML Server enables multi-index searches and also uses an extensive set of search operators to provide the most accurate results. Search operators include And, Or, And Not (exclusion), Near (unordered adjacency), adjacency, frequency, single character and multi-character wildcards… Advanced full-text search can apply to a complete document or to sections of a document as defined in the Index Definition. TEXTML Server’s search engine takes access rights into account and automatically removes from the search results any document the user is not allowed to view.

TEXTML Server Whitepaper


XML Query Language
TEXTML Server’s powerful query language allows for the creation of complex multi-criteria searches that combine full-text search with search for any non-full-text value, such as character strings, dates or numeric values/ranges. TEXTML Server’s query language is expressed as an XML document, making it ideal for both humans to understand and applications to generate programmatically. Below is a sample query to search for “all documents written by Paul Smith published between January 1st, 2003 and July 31st, 2003 that include the word California”.

<?xml version = "1.0"?> <query Version = "3.5" RESULTSPACE = "Results"> <andkey> <key NAME = "Full Text"> <elem>California</elem> </key> <key NAME = "Author"> <elem>Paul Smith</elem> </key> <key NAME = "PublicationDate"> <start>January 1, 2003</start> <end>July 31, 2003</end> </key> </andkey> </query>

Figure 6 - Sample TEXTML Server query

TEXTML RSA - Replication Service Agent
TEXTML RSA, TEXTML Server’s Replication Service Agent, allows distributing search load on multiple servers by exposing a TEXTML Server Document Base as the Publisher of content to one or many separate Document Bases, which act as Subscribers. The Publisher and its Subscribers may be on a single machine or distributed across multiple servers on a LAN or over the Internet. The Publisher is the Document Base that holds the main source of information. Subscribers are Document Bases that contain a copy of the Publisher and may subscribe in either Push or Pull mode.

TEXTML Server Whitepaper


Because it creates copies of a document base's content, TEXTML RSA enables the development of applications that provide more reliability in case of failure as these applications can access the Document Bases’ copies for content. TEXTML Server replication can run continuously, according to a predetermined schedule, or ondemand. Below are a few examples of how TEXTML RSA may be used in specific scenarios where replication may be desired. Replicating content to distribute search load among several TEXTML Server instances, in the case of high traffic public website with extremely heavy query loads.

Figure 7 - Search load balancing architecture

TEXTML Server Whitepaper


Distributing content through the Internet or a WAN to remote servers, mostly used by large organizations with facilities spread out geographically.

Figure 8 - Distribution of content to multiple sites Distributing content on-demand to stand-alone computers that connect to the network from time to time to get the latest updates.

Figure 9 - On-demand distribution of content to stand-alone computers
TEXTML Server Whitepaper 15

TEXTML FTS - Fault Tolerance Server
In order to meet the requirements imposed by mission critical systems that cannot afford downtime, TEXTML Server can be used in two failover configurations: using Microsoft Windows Clustering or by deploying TEXTML FTS (Fault Tolerance Server).

Using TEXTML Server with Microsoft Windows Clustering
In such a configuration several servers share the same storage resources. Windows Clustering directs transactions to server A. If server A fails, Windows Clustering moves the transactions to server B. This is called failover. During failover, if the document base is stable, the service is brought back online in less than a minute. In the event that server A failure caused the Document Base to be unstable, server B detects the corruption and automatically repairs the Document Base.

Figure 10 - Windows Clustering and TEXTML Server

Using TEXTML Server with TEXTML FTS
For Document Bases that require high availability and minimum downtime, IXIASOFT proposes the use of TEXTML FTS, a fault tolerance server. This server ensures that two or more servers and Document Bases installed in parallel are synchronized so that one of the servers can take over almost instantly if the other one fails.

Figure 11 - TEXTML FTS (Fault Tolerance Server) For more information about Windows Clustering services and TEXTML FTS please contact IXIASOFT.
TEXTML Server Whitepaper 16

TEXTML Server has been specifically designed to meet new requirements stemming from widespread XML content management. Each of TEXTML Server’s functions has been developed with XML in mind to enable users to make full use of rich XML content with superior content indexing and retrieval performance. Thanks to its innovative technology, TEXTML Server reaches unmatched efficiency and performance levels in the world of XML databases, while needing minimum hardware and software resources: TEXTML Server stores, indexes and searches rich XML documents containing structured, semi-structured or unstructured content, as well as non-XML documents. Thanks to the Index Definition, developers are no longer limited by the structure of the documents. They are totally free to choose what content they want indexed and can adapt the indexes to the application functions they want to offer their end-users. They are no longer compelled to adapt the application to varied and variable document structures. The dynamic indexing engine allows for real-time content modifications that can be viewed instantly by end-users. The content published by the application is always up-to-date and reliable without the end-user noticing any service interruption while content is being updated. Check-in/check-out and security functions ensure content integrity and document confidentiality when needed. Since these features are built in TEXTML Server, no additional development is required to implement them in third party applications. Automated Document Base recovery, replication, and fault tolerance features make the final application reliable and robust as far as content availability and integrity are concerned. In a nutshell, developers of document-centric applications can concentrate their efforts on developing features of their application while relying on TEXTML Server for storage, indexing and content search and retrieval.

TEXTML Server Whitepaper


Other Related Documentation
TEXTML Server User Documentation
Introducing TEXTML Server Installation Guide Sample Client Application Guided Tour Administrator’s Guide Administration Console Reference Creating Client Applications for TEXTML Server – Programmer’s Guide API and DTD Help These documents are installed with TEXTML Server and also available online at

Additional Product Information
Please visit our Web site for case studies, technical papers, press releases and more:

TEXTML Server Whitepaper


Shared By:
Tags: White, Paper