Docstoc

NoSQL-Composite_Data_Virtualization

Document Sample
NoSQL-Composite_Data_Virtualization Powered By Docstoc
					Composite Data Virtualization




   Composite Data Virtualization
       And NOSQL Data Stores
               Composite Software

                     October 2010
TABLE OF CONTENTS
  INTRODUCTION .................................................................................................................................... 3
  BUSINESS AND IT DRIVERS ............................................................................................................... 4
  NOSQL DATA STORES LANDSCAPE................................................................................................. 5
    TABULAR / COLUMNAR DATA STORES ........................................................................................ 5
    DOCUMENT STORES ................................................................................................................. 5
    GRAPH DATABASES .................................................................................................................. 5
    KEY/VALUE STORES ................................................................................................................. 5
    OBJECT AND MULTI-VALUE DATABASES...................................................................................... 5
    MISCELLANEOUS NOSQL SOURCES .......................................................................................... 5
  INTEGRATING NOSQL DATA STORES USING DATA VIRTUALIZATION ....................................... 6
     TABULAR/COLUMNAR DATA STORES .......................................................................................... 6
     XML DOCUMENT STORES ......................................................................................................... 7
     KEY/VALUE STORES ................................................................................................................. 7
  SUMMARY ............................................................................................................................................. 8




                                                                                                                             Composite Software | 2
INTRODUCTION
There is a trend in the data storage and management arena to consider data storage options
beyond the traditional SQL-based relational database. The overall movement began in 2009
and was known as NoSQL (meaning “no SQL”), but that label has since evolved into NOSQL
(meaning “not only SQL”). Unfortunately both of these labels say more about what it isn’t than
what it is, and this is the source of ongoing confusion for this whole class of data stores.

The general definition of a NOSQL data store is that is manages data that is not strictly tabular
and relational, so it does not make sense to use SQL for the creation and retrieval of the data.
More specifically, NOSQL data stores are usually non-relational, distributed, open-source, and
horizontally scalable, although there are exceptions to each of these for specific NOSQL data
stores.

While NOSQL access standards have yet to fully develop, each implementation provides some
sort of Java-based development API appropriate for accessing that type of NOSQL data. The
Composite Data virtualization Platforms typically use these APIs to access and integrate
NOSQL data, with three kinds of NOSQL data sources a natural integration fit.

This paper describes the primary NOSQL data sources in the market today and how to
integrate them with other sources using the Composite Data Virtualization Platform.




                                                                                 Composite Software | 3
BUSINESS AND IT DRIVERS
The main driver for the creation of NOSQL data stores was the emergence of “web-scale” data
— i.e., massive amounts of data — at the large web sites and services like Amazon, Google,
Yahoo!, Facebook, etc. A number of NOSQL data stores emerged from custom engineering
development done at these large companies. Recently predictive analytics, voice-of-the-
customer, churn, fraud and other “big data” use cases have emerged to further accelerate
demand.

Storing and processing this data revealed several specific motivations for these new data stores
including:

•   Cost per Terabyte: Many of the NOSQL data sources were invented to handle web-scale
    data that is created in enormous volumes (e.g., web site click streams), and storing this
    much data in a traditional relational database would be expensive and inefficient. Many of
    the NOSQL data sources are open source and run on commodity hardware, making them
    considerably less expensive per terabyte than traditional databases from vendors like
    Oracle and Teradata.
•   Distributed Processing: Web-scale data is so large that the traditional database approach
    to storage, indexing, and retrieval does not work very well with this class of data. NOSQL
    data sources introduce storage architectures that scale horizontally; and parallel algorithms
    designed to efficiently process the distributed data (“map-reduce” being the most prominent
    example).
•   Data Shape Appropriateness: Many successful web-based services have introduced data
    that is not efficiently represented as relational, motivating new data structures more
    appropriate to the data. For example, social media web sites employ graph databases to
    represent the social relationships inherent in these services.




                                                                               Composite Software | 4
NOSQL DATA STORES LANDSCAPE
Although the original emergence of NOSQL data stores was motivated by web-scale data, the
movement has grown to encompass a wide variety of data stores that just happen to not use
SQL as their processing language (making it difficult to characterize exactly what a NOSQL
data store is). There is no general agreement on the taxonomy of NOSQL data stores, but the
categories below capture much of the landscape.

Tabular / Columnar Data Stores
Storing sparse tabular data, these stores look most like traditional tabular databases. Examples
include Hadoop/HBase (Yahoo!), BigTable (Google), Hypertable and VoltDB. Their primary
data retrieval paradigm utilizes column filters, generally leveraging hand-coded map-reduce
algorithms.

Document Stores
These NOSQL data sources store unstructured (i.e., text) or semi-structured (i.e., XML)
documents. Examples include MongoDB, Mark Logic and CouchDB. Their data retrieval
paradigm varies highly, but documents can always be retrieved by unique handle. XML data
sources leverage XQuery. Text documents are indexed, facilitating keyword search-like
retrieval.

Graph Databases
These NOSQL sources store graph-oriented data with nodes, edges, and properties and are
commonly used to store associations in social networks. Examples include Neo4J,
AllegroGraph and FlockDB. Data retrieval focuses on retrieving associations from a particular
node.

Key/Value Stores
These sources store simple key/value pairs like a traditional hashtable. They are further
subdivided into in-memory and disk-based solutions. This category of NOSQL systems
probably has the largest number of members, each embodying slightly different characteristics.
Examples include Memcached, Cassandra (Facebook), SimpleDB, Dynamo (Amazon),
Voldemort (Linked-In) and Kyoto Cabinet. Their data retrieval paradigm is simple; given a key,
return the value. Some offer more complex “querying” mechanisms that can look inside the
value, but normally the value is considered opaque.

Object and Multi-value Databases
These types of stores preceded the NOSQL movement, but they have found new life as part of
the movement. Object databases store objects (as in object-oriented programming). Multi-value
databases store tabular data, but individual cells can store multiple values. Examples include
Objectivity, GemStone and Unidata. Proprietary query languages are used to retrieve data.

Miscellaneous NOSQL Sources
Several other data stores can be classified as NOSQL stores, but they don’t fit into any of the
categories above. Examples include: GT.M, IBM Lotus/Domino, and the ISIS family.



                                                                                Composite Software | 5
INTEGRATING NOSQL DATA STORES USING DATA VIRTUALIZATION
The Composite Data Virtualization Platform provides a complete development and runtime
environment for discovering, accessing, federating, abstracting and delivering data from diverse
sources. Access is typically done via standards-based protocols and APIs, for example JDBC
and ODBC for SQL-based sources, HTTP and SOAP for Web services, JMS for messages,
APIs for enterprise and cloud-based applications. Through these methods, source data is
securely exposed from a single virtual location, regardless of how and where it is physically
stored.

While NOSQL access standards have yet to fully develop, each implementation provides some
sort of Java-based development API appropriate for accessing that type of NOSQL data. The
Composite Data Virtualization Platform uses these APIs as well as Composite’s Custom Java
Procedure (CJP) resource to access and integrate NOSQL data. Three kinds of NOSQL
systems are a particularly natural fit for this integration approach. These include
Tabular/Columnar Data Stores, XML Document Stores, and Key/Value Stores. A more detailed
integration approach for each of these is outlined below.

Over time, as NOSQL leaders emerge and usage patterns solidify, Composite may elect to
provide more in-depth integrations with particular NOSQL data stores through the creation of
fully supported adapters.

Tabular/Columnar Data Stores
Because the original implementation of the Composite Data Virtualization Platform integrated
tabular data, retrieving and processing data from this category of NOSQL data store is an easy
fit. This approach leverages Composite’s ability to incorporate “table functions” in the FROM
clause of a SQL statement. That is, any Composite procedure resource that returns a cursor
can be dropped into the View editor as a table, where it will show up in the FROM clause of the
SQL statement.

For a specific NOSQL data store, a collection of CJP table functions can be implemented that
leverage the NOSQL system’s Java API. Each CJP would provide access to a different table in
the underlying NOSQL data store. The CJPs can take input arguments to filter the data from the
table, further leveraging the NOSQL system’s processing capability. The values of the filters
can even be specified at run-time from a client query by leveraging the “virtual column”
capability of Views.

It is worth remembering that these tabular/columnar NOSQL data sources store very large data
sets, so caution must be used on large queries. The table function implementation should
ensure sufficient data reduction in the target data source by leveraging input parameters. Also,
the processing of requests to these data sources can take a very long time (more like batch
jobs than live queries), so employing some form of caching would probably be prudent.

This approach provides full access to the data in the underlying NOSQL system and it will likely
meet most near term needs. There are, however, some disadvantages and inefficiencies in this
approach. For example, all the columns specified in the CJP’s cursor would always be
retrieved, even if they weren’t all necessary for the current query. Also, more generic filtering
and aggregation might be possible with the underlying system, but the CJP provides only a
limited interface to expose that capability to Composite. If a particular NOSQL Tabular data

                                                                                Composite Software | 6
store becomes quite popular, it would be an ideal candidate for Composite to develop a custom
adapter that would fully integrate and leverage that specific data source’s capabilities.

XML Document Stores
Because XML document stores utilize XQuery as their preferred data retrieval paradigm, the
Composite Data Virtualization Platform leverages its embedded XQuery engines and XML
native data type to easily retrieve and further process documents from this category of NOSQL
data store.

For a specific NOSQL XML document store with a Java API, a minimum of two CJP procedures
are required. Both CJPs return an XML document that can be further manipulated by any of the
upstream XML manipulation functionality (e.g., XSLT Transformations). The first CJP would
take a document handle (unique identifier) as its only input argument, and then leverage the
API to retrieve and return that document. The second CJP would take an XQuery specification
as its only input argument, and then leverage the API to execute the query and return the
results as a single document. Of course, additional CJPs accepting more specific parameters
could also be implemented, facilitating easier integration into multiple views.

This approach provides full access to the data in the underlying XML data source, and it will
likely be sufficient for most needs.

Key/Value Stores
The Composite Data Virtualization Platform can integrate key/value stores in two ways.

The first is through a custom SQL function. That is, a function can be created that takes the key
as a parameter, and returns the value. This function can then be used in multiple SQL
statements throughout Composite.

In the second, Composite leverages the in-memory key/value store as a cache target. This is
the primary use-case typically described by our enterprise customers. This approach is best for
small data sets or procedure results, but it doesn’t work as well for large tabular data sets.
Further, this form of cache integration is often challenged by the impedance mismatch between
cached tabular data and cached key/value data (the cached data is opaque inside the key/value
store), so the entire set must be retrieved for processing. This form of integration is available
today from our professional services organization.




                                                                                Composite Software | 7
SUMMARY
NOSQL data stores are proliferating as a means of supporting web-scale data. Recently
predictive analytics, voice-of-the-customer, churn, fraud and other “big data” use cases have
emerged to further accelerate demand.

There are a wide variety of NOSQL systems, each with their own set of use-cases and
advantages. Each NOSQL data store has a unique and non-standard API that can be used to
access and integrate these sources.

The Composite Data Virtualization Platform is well suited for integrating data from these
NOSQL sources with other data within and outside the enterprise.

This paper describes integrations for three flavors of NOSQL data stores: Tabular/Columnar
Data Stores, XML Document Stores, and In-Memory Key/Value Stores. Today, Composite can
provide basic access to data from any of these NOSQL data stores with minimal programming,
using standard resources. In the longer term, when leaders in particular areas of the NOSQL
landscape emerge, Composite may provide deeper integrations through standard product
adapters that within the Composite Application Data Services product line.




                                                                                Composite Software | 8
ABOUT COMPOSITE SOFTWARE

Composite Software, Inc. ® is the data virtualization gold standard at ten of the top 20 banks, six of the top ten
pharmaceutical companies, four of the top five energy firms, major media and technology organizations; and
multiple government agencies.

These are among the hundreds of global organizations with disparate, complex information environments that count
on the Composite to increase their data agility, cut costs and reduce risk. Backed by nearly a decade of pioneering
R&D, Composite is the data virtualization performance leader, scaling from project to enterprise for data federation,
data warehouse extension, enterprise data sharing, real-time and cloud computing data integration.

Founded in 2002, Composite Software is a privately held, venture-funded corporation based in Silicon Valley. For
more information, please visit www.compositesw.com.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:6/11/2012
language:Latin
pages:9