Converging Text and BI--The Case for a Content Mining Platform

Document Sample

Shared by: Lisa Wenner
Stats
views:
123
posted:
4/7/2008
language:
English
pages:
23
Converging Text and BI:

The Case for a Content Mining Platform



Although enterprises commonly utilize business intelligence (BI) tools

against structured data for analysis and decision making, leading

organizations recognize that they must take a more holistic view of their

information assets and find ways to creatively analyze the exponentially

growing universe of unstructured content - contracts, press releases, filings,

forms, call center notes, medical records, insurance claims, web content,

emails, etc. This white paper describes how the Clarabridge Content Mining

Platform™ avoids the pitfalls of previous approaches to unstructured

analysis, and capitalizes on lessons learned from solving similar problems

in the structured domain. A platform approach enables enterprises to

efficiently and effectively source, transform, store, and analyze unstructured

data alongside structured data – in a way that is easy to manage. The result

is broader business understanding, the ability to leverage existing

resources, and the freedom to rapidly apply the most appropriate decision

support interface.

WHITE PAPER



March 6, 2006

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Executive Summary

It is becoming ever more important for today's agile enterprise to use the best available data to drive strategic and

operational business decisions. Although most companies deploy business intelligence (BI) tools against structured

data to answer a wide variety of questions, leading organizations are increasingly recognizing that they must take a

more holistic view of their information assets. They find creative ways to analyze the exponentially growing universe

of unstructured content – contracts, press releases, research papers, filings, call center notes, medical records,

insurance claims, web content, emails, etc. This content when understood and analyzed alongside structured data

provides business insight that enables organizations to better serve customers, control cost and risk, compete

effectively, and drive profitability.



Text processing technologies are rapidly maturing

to enable concept/entity extraction, relationship BI, OLAP,

Unstructured Data Search Structured Data

Reporting

tagging, and other paradigms to allow more

structure to be applied to unstructured data. Knowledge Worker



Search technologies are evolving to provide end Currently end users have separate interfaces for Structured and

users with better ways to retrieve text, but provide Unstructured data : Search for Unstructured , and BI for Structured .



limited to no analytic insight, which makes the

determination of precise answers to questions

•Call center notes How can we improve satisfaction? •Customer demographics

time consuming, tedious, and increasingly more

difficult. Even more advanced implementations of

text processing technologies require complex •Warranty repair notes What is root cause of problem? •Service ticket & outcome

programming work and are, like search engine

technologies, totally disconnected from time- •Clinical Notes How do symptoms change over time? •Patient records

tested analysis approaches used in the BI world.



So how can enterprises better enable users to Figure 1 – Currently structured and unstructured analysis are done in

spend their days making informed decisions different ways and with different tools.

versus gathering data? Fortunately, there is much

to be learned from two decades of struggling with similar problems in the structured data world. We now know as

needs change and evolve, organizations will require the flexibility to integrate the most appropriate text processing

technologies to extract desired information. They must enable users to apply time-tested analytical approaches that

can be modified or expanded upon as understanding of issues and opportunities emerges from the data itself. For

example, a call center should be able to apply a multi-dimensional analysis (i.e., “slice and dice”) to call center logs

and email text for assessing trends, root causes, and relationships between issues, people, time to resolution, etc.

Organizations should have the infrastructure, storage, and user interfaces to process and efficiently explore large

volumes of data. And they need to easily leverage their existing BI and data warehousing (DW) tools presently used

only for structured data analyses, to analyze unstructured data alongside structured data.



As organizations adopt analytical approaches to unstructured data, they will need to address a number of challenges:



• Data comes from multiple unstructured repositories (file servers, document management systems, intranet sites,

internet sites, database notes fields, etc.)



• Data in unstructured documents is of widely varying quality (often much more so than structured)



• The use of different types of unstructured data tools varies greatly from environment to environment and from

problem to problem.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 2 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









• In many cases maximum value in analyzing unstructured data comes from analyzing it alongside existing

structured data in data marts or data warehouses



Fortunately, many of the challenges with unstructured data analytics can be overcome by applying lessons from the

BI and DW sectors. Over the past 10 years departmental, point solutions of the early 1990’s rapidly evolved to more

robust solutions that leveraged enterprise data warehousing platforms, an extract, transform, and load (ETL)

infrastructure, and scalable, server-based BI or reporting solutions.



To be successful in the unstructured world, organizations need a platform to leverage their existing BI investments

and also efficiently and effectively source, transform, store, and analyze unstructured data – and do so in a way that

is easy to manage and scale. That was the vision behind the Clarabridge Content Mining Platform™. The Clarabridge

Content Mining Platform enables enterprises to:



• Source. The Clarabridge platform connects to a variety of source systems and data types.



• Transform. Once Clarabridge sources the unstructured data, a variety of out-of-the box and third party tools help

to ensure it is understood, merged, and integrated with other structured and unstructured data sources.



• Store. As unstructured data volumes explode, Clarabridge responds with a highly scalable architecture that

utilizes proven data warehousing techniques and platforms.



• Analyze. End-users are able to efficiently analyze large volumes of data, using whatever analytical technique or

tool they feel is appropriate for the problem at hand.



• Manage. As the application evolves and grows, the IT organization does not have to maintain lots of custom

coding or extensions with Clarabridge. Further, the architecture scales and integrates into existing efforts.



Using the Clarabridge Content Mining Platform enables users to directly mine text alongside existing structured data,

using standard BI tools and analysis techniques, to address a host of real-world business needs. The benefits are

enormous and include:



• Broader analysis capabilities. Users spend more time analyzing and less time retrieving text. They are free to

apply proven analysis approaches to virtually unlimited data to detect trends, issues, and opportunities revealed

by their unstructured data. Further, users enjoy a holistic view of their information assets to enable analytical

discovery across multiple problem domains, data types, and source systems.



• Faster ROI. Organizations have made significant investments in BI and DW solutions. Customers are able to

rapidly extend the value of those investments and their trained staff rather than deploying new tools and analysis

approaches.



• Rapid time-to-value. Rather than “re-invent the wheel”, organizations are able to rapidly integrate and leverage

leading unstructured and structured tools, including: statistical analysis, reporting, visualization, search, data

mining, and text processing engines. Leveraging a platform allows less up-front work when developing an

application and delivers fast data access, data quality, and time to analysis.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 3 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Tapping Unstructured Data to Drive Business Value

Organizations today are buried in unstructured content such as contracts, press releases, research papers, forms,

filings, call center notes, medical records, insurance claims, web content, emails, etc. Experts agree this content

represents more than 80% of an organization’s data. And the amount is growing every day. Furthermore, in an

increasingly services-based economy, unstructured transcripts, notes and documents describing business activity

provide important insights about customer’s habits, tastes, product use and support requirements, employee work

habits and performance, and business process efficiencies and failures.



It is becoming ever more important for today's agile enterprise to utilize the best available data to drive strategic and

operational business decisions. Although most companies utilize business intelligence (BI) tools against structured

data to answer a wide variety of questions, leading organizations are increasingly recognizing that they must find

ways to creatively analyze the exponentially growing universe of unstructured content.



Unfortunately, the structured and unstructured data analysis domains have traditionally been separated along a

number of dimensions including analysis approaches, storage, and staff. Typically analysis of unstructured data

involves using a search tool to find documents containing information you are looking for, whereas structured data

involves using BI or data mining tools to report on performance indicators, trends, changes over time or other

quantitative metrics of business activity. Unstructured data is typically stored in file-based servers (such as web

servers, document management servers, etc), while structured data is almost always stored in relational database

management systems (RDBMS). Lastly, staff trained in BI are typically not skilled in the linguistic and other

specialized techniques required

for analyzing unstructured content “Unstructured” Information “Structured” Information



and thus rarely use the tools and

technologies associated with

unstructured data analysis. What

Web Content/

is needed is a way to converge the Documents

Intranets

Data

Data Marts

Warehouses

two domains, leveraging the best

from both, to unlock the true

Spreadsheets Emails CRM, ERP, etc.

potential of unstructured data. Systems





When understood and analyzed

Web Content Paper Files Operational Data

alongside structured data, Metadata

Stores



unstructured data provides

business insight that enables

organizations to better serve

customers, control cost and risk,

and identify opportunities for Figure 2 – Unstructured and Structured information have traditionally

increased efficiency. Example use been separated along a number of dimensions.

scenarios include:



• Automatically identifying top issues in (unstructured) call center logs and proactively routing calls to the right

person based on the issue can save millions through reduced call time, not to mention improved customer

service.



• Identifying and addressing the top types of problems encountered by the most profitable customers can help

reduce loyal customer churn.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 4 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









• Rapidly detecting emerging product trends in problem-reports coming in from all over the globe can avoid recalls,

lawsuits, potentially saving companies millions of dollars.



• Analyzing patient comments, doctor notes, and symptom data can lead to better disease management and

identification of new uses for drugs.



• Capitalizing on customer feedback following a product launch can help adjust marketing campaigns months

ahead of competitors.



• Reducing hundreds of boxes of documents down to the two that are relevant as part of the legal discovery

process reveals previously hidden information in less time than if all documents were read by human beings, and

focuses critical resources on higher value tasks.



• Analyzing communications patterns, claims data and patient records to identify insurance fraud can significantly

reduce fraudulent claims.



• Automatically mining thousands of SEC reports to predict poor corporate governance can help identify issues

before they turn into major crises.





The Evolution of Unstructured Analytics

If the potential is so great, why aren’t more organizations employing unstructured analytics? In large part the

underlying technologies for unstructured analytics have only recently matured to support the types of analysis

suggested above. Text processing technology has progressed from first generation keyword search to second

generation point text analysis applications. Reviewing these first two generations of solutions, we begin to see the

potential of unstructured data, but we also see that these technologies are inadequate for true unstructured analytics.

This has driven the need for a third generation solution: a Content Mining Platform. With a Content Mining Platform,

organizations can finally unlock the value of all their information assets,

revealing a wealth of intelligence about their customers, competitors,

With a Content Mining Platform,

suppliers, and internal operations.

organizations will finally unlock the

First Generation Solutions: Search value of all their information assets,

The first generation text analysis technologies performed “search” – providing revealing a wealth of intelligence about

keyword search to help a user find documents containing the searched for

words and concepts described by the keywords. While great for retrieving their customers, competitors, suppliers,

and grouping keywords within documents, these tools have many well-known and internal operations.

problems that make them impractical to use for unstructured analytics:



• Inability to store and quantify changes over time. They are unable to

easily integrate with databases to efficiently store results and changes over time, and thus are unable to track or

quantify the evolution of ideas, or the changes in activity levels of tracked people, processes, or organizations

that may be searched.



• Simple interface. Search tools were designed to be easy to use, which restricts their analysis capability to simple

Boolean (not/and/or) expressions.



• Inability to extract meaning. Although great at rapidly returning documents, a user still must take the time to read

through the returned documents to extract meaning from them.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 5 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









• Use keyword matching, not semantic understanding. A search tool relies on the user to identify the right

combination of keywords to extract the desired information. Unfortunately, keywords can have different meanings

in different contexts, which results in many irrelevant responses while at the same time excludes relevant ones

that don’t happen to contain the original keywords.



As a result of these limitations, search typically requires a great deal of effort on the part of the user to manually sift

through documents and connect bits and pieces of information to make decisions from unstructured data. Further,

there is no way to analyze large volumes of unstructured search results alongside structured data. For example,

many law firms hire paralegals or junior lawyers to manually sift through documents using search interfaces during a

discovery process to tag those that are relevant.



Although suitable for some applications, first generation search technology is time consuming and imprecise when

used for complex business decision making. And it can be very expensive as highly compensated individuals are

used for lower-level tasks like reading through documents to look for routine information.





Second Generation Solutions: Point Text Analysis Applications

The limitations of the keyword search applications led to a second generation technology, point text analysis

applications. Many tools exist to solve a variety of problems related to understanding the true meaning of a

document. These tools can scan a text document, for example, and pull out chemical names and their interactions, or

identify events, locations, products, opinions about products, problems, methods, etc. Vendors may call their products

“Entity Extraction, “Concept Extraction”, or “Name Matching” products. The technologies all tend to be stove-piped in

that they solve a specific problem or work in a specific functional domain.



While these tools perform the valuable task of helping to resolve documents to a more granular level – for example

identifying and linking actors and events with each other by intelligently parsing and organizing the concepts

contained in a document – these tools still present a number of challenges when organizations try to use them for

analysis and decision making, including:



• Stove-piped. A solution involving such technologies may not be integrated with standard document

management, database management, data migration, metadata management, and data analysis tools. Further,

any integration to these platforms can require custom programming to

create enterprise application integration (EAI) solution.

…the problem with these second

• Offer varying degrees of insights. Each tool tends to be good at specific

functions, and at extracting specific types of information, but not others.

generation approaches is that they

One tool may be very good at extracting names, while another may be require a precise understanding of the

good at extracting events, but no one tool can perform both tasks well. To

get the full value of all tools a customer is faced with the task of again problem up-front and an enterprise

manually integrating the products to each other. architecture that never changes.

• Questionable scalability. These tools are unable to easily integrate with

database to efficiently store results and changes over time, and many are

unable to handle volumes of data in the 50GB – 100TB range



• Different types of users. Users must have a linguistic background or have specialized training that doesn’t

commonly exist in an organization.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 6 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









• Require manual rule building up-front. Many tools require entity extraction rules to be defined up-front about

the text being analyzed which presumes that the problem is fully defined up-front – which is rarely the case.

Further, this means that it takes months to even begin an analysis while rules are defined, coded, and refined.



• Incomplete analysis. Because each tool is meant for a specific purpose, there are limits to the types of analysis

that can be performed. Once each analysis is complete, the whole system is often shut down and the data is

thrown away.



• Re-invent the wheel. Although analysis interfaces surpass those of search, they lack the interface or analytical

maturity of existing structured analysis tools, and are not easily compatible with market leading business

intelligence, statistical, or visualization tools.



• Multi-vendor solutions require custom coding. Combining two or more solutions typically requires a great

deal of custom development. For example, the FBI has notoriously had a difficult time combining the results of

multiple extraction solutions into a single consolidated view of a situation. A recent CNN article reported, “The

current program requires FBI personnel to manually enter, print, sign and scan their information into the

investigative data warehouse."



In short, the problem with these second generation approaches is that they require a precise understanding of the

problem up-front and an enterprise architecture that never changes. In the real world, this never happens. Evolving

requirements ultimately drive the need for custom modifications and extensions, which are increasingly costly, less

scalable, and less maintainable over time. Further, the approach requires training employees on new analysis

techniques, who in many cases, have spent years with traditional BI tools and would prefer to leverage the strengths

of existing tools against unstructured data.



Not surprisingly, second generation applications are analogous to point BI applications of the early 90s, as described

in the next section. Although it is possible to perform limited unstructured analytics with these point-solution tools,

forward-looking organizations understand the pitfalls of being strictly locked into a point-application.





Lessons Learned from the BI and DW Worlds

We have learned quite a bit, from nearly two-decades of BI and DW experience, which can be applied to the

unstructured analytics domain.



AN “ETL” PARADIGM FOR UNSTRUCTURED ANALYTICS

First, by looking at the history of data integration we know that enterprises will need an “ETL” paradigm for the data

that will drive unstructured analytics. In the ‘early days’ of data warehousing of the early 1990s, organizations

originally created “stove piped” or “point reporting” solutions that reported against a single data source under a well-

defined set of business rules. In many organizations, for instance, financial

reporting solutions were created by creating point applications to extract

…in the unstructured world, we

general ledger and accounts payables/receivables tables from a financial

system, applying business rules to the data, and loading the resulting see an even more compelling

information into a reporting database – with reporting applications created

against the reporting database. Similar applications were created for sales

need for an “ETL” approach to

reporting, customer relationship analysis, inventory/supply chain reporting, enable analysis…

and other business departments or functions. Most enterprises had

dozens of these applications.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 7 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









This approach became an issue for several reasons. First, as organizations evolved to cross-functional views of their

enterprise to focus on business drivers, such as “customer intelligence” and “risk” or “costs”, they needed to merge

data from multiple sources. Because modifying one application caused a downstream and upstream impact on all

others, point solutions became too difficult to maintain and costs quickly spiraled out of control. Second, as

organizations realized that the original problem they were trying to solve had changed, they inevitably needed to go

back and add more business rules or extract more data. Clearly, maintaining custom code becomes very difficult.

Finally, these two issues magnify one another, creating an exponentially growing problem.



In the structured data realm, integrating stove-piped data to support aggregated and well supported decisions is not a

new problem. Over the past 10+ years this problem has been a central focus of large organizations building data

warehouses, and decision support, or “business intelligence” applications to analyze structured data. Many data

warehouse technologies are now mature enough to support high-quality extract, transformation and loading (ETL), or

“fusion” of information for analysis. Rather than re-coding, the ETL platform can be simply re-configured to

accommodate new business rules and data sources. However, these technologies almost exclusively focus on the

integration of structured data.



Unstructured data requires additional pre-processing before it can be loaded into a repository for analysis – data must

be extracted from documents, tagged by entity, concept, and relationship, cleansed and transformed to ensure data

quality, and finally loaded into a repository so that analytical tools can be used against it. Complicating matters

further, unstructured data tools and technologies are not well integrated, and often require custom integration with

each other in order to create a robust solution. To perform robust processing and analysis of unstructured data,

multiple products may be required to perform functions such as:



• Entity Extraction



• Concept Determination



• Industry specific thesaurus matching

…as companies start getting

• Data quality (to resolve varying forms of the same word, typos, and “dirty

data”) insights from unstructured data, a



• Name matching (to resolve identities, products, foreign word spellings) wealth of new opportunities will



• Statistical reporting

emerge, making it nearly

impossible to pre-determine the

• Business Intelligence/data mining

questions you will need to answer

• Link analysis

using your unstructured data.

• Visualization and spatial imaging and mapping



• Ad-hoc query creation



To merge structured and unstructured information for analysis a solution must address integration, administration,

and presentation of the data through a variety of technologies and processes. In short, in the unstructured world, we

see an even more compelling need for an “ETL” approach to enable analysis because there are:



• More data sources and types;



• More data transformations; and



• More types of analysis possible.







© 2006 Clarabridge, Inc. All Rights Reserved. Page 8 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









This creates a many-to-many-to-many situation between source, transformation, and targets respectively, which will

make point-solutions very difficult to maintain over time. Further, as companies start getting insights from

unstructured data, a wealth of new opportunities will emerge making it nearly impossible to pre-determine the

questions you will need to answer using your unstructured data.



Clearly an “ETL” paradigm is critical to deploy enterprise class unstructured analytics.



LEARNING FROM ENTERPRISE BI DEPLOYMENTS

Besides the need for an “ETL” paradigm, lessons learned from world class enterprise BI and DW implementations tell

us that:



• Data volume / scalability will be important. Structured databases are commonly in the multi-terabyte sizes.

Unstructured data, which is five times more prevalent, must also be stored in a highly scalable architecture.



• Users need flexibility to choose any analysis tools. We know from experience with structured analytics that

users will want to analyze unstructured documents using popular BI, Search, or other analysis interfaces.



• A relational data warehouse is required to do real analytics. Relational data warehouses are highly scalable

and have the most analytical flexibility, which will be as important for unstructured analytics as it is in structured.



• Analytical approaches applied to structured data also apply to unstructured content. Users will need the

ability to utilize existing analytical approaches, such as: multi-dimensional analysis, time-series analysis, ranking

analysis, market-basket analysis, and anomaly analysis.



• Leveraging best-of-breed is important. As is true in the structured world, a next generation unstructured

analysis solution must integrate a variety of independently developed technologies. These must be integrated to

perform a comprehensive analysis task and then their results must be funneled into systems that allow users to

rapidly find and exploit the discovered knowledge, for example, search engines, databases and/or knowledge

bases.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 9 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Ushering in the Third Generation of Unstructured Analytics

We can see that a new approach is needed to unstructured analytics. This new approach is innovative, but

fortunately the underlying technologies are proven. The Clarabridge Content Mining Platform is designed to avoid the

pitfalls of the first two generations, capitalize on lessons learned, and be successful in the unstructured analysis

future. Clarabridge effectively converges the text and business intelligence worlds. Clarabridge takes “ETL” a step

further in actually providing a pre-packed analytical database as well as connectors to various analytical front ends.

Specifically, Clarabridge employs a platform approach to efficiently and effectively source, transform, store, and

analyze unstructured data – and do so in a way that is easy to manage as depicted in the below figure.









Figure 3 – Components of the Clarabridge Content Mining Platform.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 10 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









• Source. The Clarabridge platform connects to a variety of unstructured and structured source systems and data

types, such as file servers, web servers, enterprise content management systems, document management

systems, enterprise applications, and many other systems containing structured and unstructured data.



• Transform. Once data is sourced, it must be understood, merged, and integrated with other structured and

unstructured content. Clarabridge performs a variety of transformations to unstructured data (i.e., concept

extraction, natural language processing, data matching, table extraction, etc.), either out-of-the-box or through

integration with other technologies.



• Store. As data volumes explode, the Clarabridge platform responds with a highly scalable architecture that

utilizes proven data warehousing techniques and platforms. Further, it uses a pre-defined schema for staging

and storing the data extracted from unstructured sources and includes a process to easily integrate transformed

data into a structured data mart, or warehouse for analysis.



• Analyze. Using Clarabridge, end-users are able to efficiently analyze large volumes of data, using whatever

analytical technique or tool they feel is appropriate for the problem at hand. The platform has connectors to a

variety of analysis interfaces, such as BI tools, data mining, visualization, and statistical analysis tools.



• Manage. As the application evolves and grows, there is little need for the Information Technology (IT) shop to

maintain lots of custom coding or extensions. Further, the architecture scales and integrates into existing efforts.



Clarabridge Content Mining Platform takes raw content and enables that content to be directly mined using any type

of analytical interface. The following sections describe the key capabilities delivered through the Clarabridge Content

Mining Platform to accomplish this result.





Source Examples of unstructured

There are many unstructured data sources and a primary objective of any sources include text fields (e.g.,

business application will be to achieve fast data access to any of those sources,

as well as a clear understanding of the original sources of that data as BLOB fields) in databases, FTP

downstream analysis progresses. Examples of unstructured sources include servers, Web servers, e-mail

text fields (e.g., BLOB fields) in databases, FTP servers, Web servers, e-mail

servers, file servers, document management systems, knowledge management servers, file servers, document

systems, enterprise search tool repositories, or scanned and OCRed paper

management systems,

files.

knowledge management

Clarabridge contains Source Connectors that interface with various sources and

repositories of unstructured data and manage the process of extracting data systems, enterprise search tool

from these various sources. Concurrent communication with a wide variety of

repositories, or scanned and

occurs via APIs, web services, or other methods. This allows the sources to be

treated as a “black box” by the rest of the process components. OCRed paper files.

The Content Mining Platform processes the text, data, and metadata returned

from the Source Connectors. It then converts the various outputs from the various unstructured source systems into a

consistent schema and format. At the same time, the various pieces of metadata that are also extracted from the

source systems are assembled into a common metadata format. The Clarabridge Extraction Connectors assign a

unique index key to each extracted source document, which allows it to be consistently traced as it moves through

the rest of the system. This key, and the associated metadata stored regarding the source location of the text, also

provides a mechanism to link back to the original text when desired during the course of analysis.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 11 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Transform

Once Clarabridge sources the unstructured data, its next function is to extract meaning, thereby transforming the raw

content into rich, structured data. This process is termed Semantic Extraction. Next, Clarabridge applies a series of

data quality and staging processes to ensure that data is of high quality and ready for analysis. Accomplishing both

requires a layer of transformation components as well as the ability to consolidate all structured and semi-structured

data, metadata and other findings into a single repository.



SEMANTIC EXTRACTION

There are all kinds of technologies being developed in industry and academia for semantic extraction of content.

Some, for example, specialize in part-of-speech detection, grammatical parsing and named-entity recognition where

proper names, organizations and locations are identified – usually when combined with a dictionary or thesaurus.

Other technologies may specialize in detecting events and times and then others work on detecting relationships

between these elements. Still others are good at extracting objects from content, such as logs, signatures, or tables.



The Clarabridge third generation solution has the flexibility to extract meaning from any content through embedded

functionally, integration with existing commercial or open source technologies (such as the second generation point

analysis tools), or through custom-developed components that are plugged-in. As technologies become obsolete or

irrelevant, it is easy to swap out older transformation components in favor of better ones without requiring new coding

or breaking the rest of the application components.



Various Clarabridge Transformation Components provide a variety of value-added semantic extraction capability as

demonstrated by the following components. Note that Clarabridge provides all of the below transformation

functionality natively within the application. However, for specific applications, it may be necessary to leverage the

second generation point analysis applications described previously from within the Clarabridge environment. For

those situations, Clarabridge has built-in connector to all of the best-of-breed text processing and transformation

components.



• Document segmentation and categorization. Transformations that are applied at the document level. For

example a process groups documents into various buckets such as “violent events” or “press release”, or

“financial transaction” through statistical analysis of the underlying content. Another process can automatically

segment the document to identify various headers,

sections, or objects within the document, such as

signatures or logos using rules or other semantic

approaches.



• Entity extraction. Functionality to determine all people,

places, dates, financial amounts, objects, etc. within a

portion of text. Basically this involves identifying and

categorizing the various proper nouns and objects

mentioned in the text.



• Event, relationship, & fact extraction. The process of

determining how various entities related to one another.

For example, if “Person A took a trip to visit Person B”,

the relationship between the two entities is “A visited B.”

Relationships can also include attributes or a fact about

specific entities, such as the number, 35, refers to Figure 4 – Example transformation workflow









© 2006 Clarabridge, Inc. All Rights Reserved. Page 12 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









“dollars” in a particular sentence.



• Table data extraction. Takes any type of text table that is readable by a human and can convert the table into

structured rows, columns, cells, headers, and multiplier representation that can then be used for further

structured analysis.



• Image extraction. Takes any embedded object, such as a logo, photo, or signature, within a document and tries

to interpret that object, typically by matching it to a database of possible objects.



DATA QUALITY AND STAGING

A final, but critical transformation step involves ensuring that the data has the proper quality and is enriched for

analysis. The following capabilities are provided within the Clarabridge Content Mining Platform, although third party

data quality components can also be integrated into the platform:



• Data Matching. Achieves better data quality by applying various cleansing techniques to resolve, for example,

different spellings of a single entity across two different documents.



• Data Merging. This involves combining like data from structured and unstructured

Since unstructured data is

data sources or enriching data by associating additional attributes about that data

from across documents. often imprecise, the ability to

• Dimensions / Hierarchy. Organizes data along relevant dimensions, such as time understand the confidence

or geography, to allow data to be “sliced and diced” along a number of dimensions.

This may also include the integration of ontologies, dictionaries, taxonomies, or

level of any findings is critical.

thesauri, which are used to organize data into various hierarchies and relationships

for further analysis.



• Confidence Level. Since unstructured data is often imprecise, the ability to understand the confidence level of

any finding is critical. Employing confidence analysis into the platform allows users to not only see and analyze

data within structured analysis tools, but also to calculate a numeric confidence level for each data element or

aggregate data calculation. The platform joins many data points that are captured throughout the flow of data to

create a weighted statically-oriented calculation of the confidence that can be assigned to any point of data.



TRANSFORMATION WORKFLOW

Clarabridge’s third generation solution includes a transformation workflow engine which manages the process of

taking the collected unstructured data and passing it through one or more transformation components.

Transformations are run in a coordinated process, as the results of one or more transformations may serve as an

input to downstream transformations. Further, Clarabridge provides a common API to a wide variety of custom, open

source, or third party unstructured data transformation technologies so each of those transformations can operate as

a “black box” abstraction from the rest of the system. Clarabridge retains complete metadata and links back to the

original source data, which allows end users to trace back through the transformation that took place and from there

back to the original source of unstructured data.



As an example, assume a regulatory organization wanted to mine SEC filings to identify related party transactions

that may indicate fraudulent activity. Using an FTP source connector to Edgar Online, the thousands of pages of

publicly available SEC filings could be accessed via the platform. Once loaded into the Capture Schema, those

documents could be run through a transformation workflow to extract sections, headers, tables, and related party

transaction information from the documents. Each of these transformations requires different technologies and

configurations, but data and metadata are managed through the transformation workflow process to allow those









© 2006 Clarabridge, Inc. All Rights Reserved. Page 13 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









components to be accessed without coding or understanding their technical underpinnings. Since the resulting data is

in a database, it is easily merged with structured financial data. Using the analysis components described in the next

section, investigators could easily see trends, statistical anomalies, and “slice and dice” companies, industries, and

transaction types to root out suspicious behavior.



SEMANTIC DISTILLATION: WHAT IT IS AND WHY IT IS IMPORTANT

For many applications involving unstructured data analysis, it is important to

perform a “semantic distillation” of the relevant source information.

Essentially, this means that all available information is sourced from the THE ESSENCE OF DATA MINING

content as efficiently as possible using a number of pre-configured

transformations. Although a certain threshold of quality is required, quantity is

Examine many combinations of

typically desired over exacting quality for these applications. This makes

parameters/variables; not just

sense for two reasons.

“obvious” ones

First, for many applications, finding a precise answer to a question is not as

Churn through millions of calculations

important as finding anomalies over large data sets. Continuing on the above

searching for patterns, relationships,

example, if an investigator is reviewing SEC filings to detect clues related to

anomalies

poor corporate governance, he would be interested in reviewing all related

party transactions identified across those filings. He might be interested in Apply multiple algorithms: linear

seeing which seem unusual when compared to industry mean activity. In this regression, trees, neural networks,

situation, the investigator is not actually looking for a specific example of poor graphs

governance. He is looking through large volumes of data to find unusual

patterns that could guide further investigation that may reveal poor Present results to user for evaluation;

governance. In other words, he is using the unstructured data to narrow the user keeps interesting results discards

scope of his analysis and increase his odds at finding an issue. Many the trivial

unstructured applications follow this same logic.

Iterate

Second, typically an analyst is not aware of what he is looking for when

looking at unstructured data until “he sees it first.” Further, one analysis can lead to another. In both cases, the

analyst requires more data to complete the investigation than was originally contemplated. There are two ways to

deal with this situation. Additional “rules” can be built into the various approaches used for data transformation.

Unfortunately, those tools require a linguistic understanding and valuable analysis time is expended while attempting

to extract the precise information needed. A better solution is to extract more data up front and utilize the analysis

tools to make the increased volume of data manageable. This is the essence of data mining in the structured world,

where the goal is to analyze large volumes of data to find interesting information. Unstructured data simply requires

the additional step of converting the content into structured data first. Fortunately, data mining tools are very

sophisticated at filtering, sorting, and grouping large volumes of data so that it is easy to manage by an end-user

user. This allows the user to pursue an “analytical discovery” process versus fully defining the problem up-front.





Store

With unstructured data greatly eclipsing structured in terms of data volumes, it is essential that any application is built

on top of a highly scalable analytical architecture. The third generation Clarabridge Content Mining Platform

architecture is designed using lessons learned over the last 15 years in the large, dynamic data warehouse space.

When dealing with data ranges between 100 gigabytes and 100’s of terabytes (or more), special approaches need to

be used at all phases of the data sourcing, transformation, and loading process.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 14 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Clarabridge employs “hooks” back to underlying source data within its data Capture Schema (pre-designed layout of

data tables and the relationship between the tables) that is specifically designed to serve as a repository for data

captured by the various sources connectors and transformations. Clarabridge is designed in an application-

independent manner so that it can hold any type of source unstructured text without being custom designed for each

application.



The Clarabridge Analysis Schema provides a data schema that can be used to perform a wide range of differing

types of analysis for wide variety of applications base on data extracted from unstructured text. It supports

commercially available analysis such as business intelligence, data mining, link analysis, mapping, visualization,

reporting, and statistical. Further, Clarabridge is designed in an open manner to support various types of analytical

applications.



A packaged ETL layer provides a mapping and loading routine to automatically

migrate data and meta data from the Capture Schema to the Analysis Schema.

This is a general-purpose ETL layer for the two general-purpose schemas that it TYPES OF ANALYSIS TOOLS

moves data between. This ETL layer can integrated with existing structured

data and applications using commercially available ETL tools. Business Intelligence (BI) Tools.

Technologies to enable raw data to be

The platform has the ability to update data as required, including: incremental,

transformed into valuable information

real-time, streaming, or batch updates. Further, it allows any preferred database

for analysis and decision making.

management system (DBMS) to be utilized.

Features usually include dashboards,

reports, ad-hoc analysis, and OLAP

Analyze analysis.



The Clarabridge third generation solution enables users to rapidly apply proven Data Mining Tools. Tools used for

analysis techniques and tools to unstructured data, effectively asking “Any pattern detection, anomaly detection,

Question”, utilizing “Any Analysis Technique” against “Any Data Source.” To and data prediction against large sets

accomplish this it enables structured tools to connect via connectors and of numerical data.

perform analysis on the transformed data to allow for structured analysis of

initially unstructured data. Data Visualization, Link Analysis,

and Mapping Tools. Tools used for

Analysis tools can access the Analysis Schema using a standard web services visually describing, presenting, and

approach, so that structured analysis tools can analyze the results of analyzing data such as connections

transformations applied to unstructured data. It also allows data to be joined to between various people, events, and

other existing structured data that may, for example, reside in a data places.

warehouse. By allowing the analysis of structured data and unstructured data

together, new insights and findings can be found that would not be possible Statistical Analysis Tools. Tools

from structured data alone. useful for the collection, analysis,

interpretation and presentation of

The Clarabridge Content Mining Platform enables various structured data masses of numerical data. Statistical

analysis tools to analyze the data present in the analysis schema. Analysis tools calculations, such as linear

may include search, visualization, BI, data mining, mapping, link analysis, and regressions, variances, and means are

statistical analysis technologies. A key principle is that a user should not have to typically applied to the data.

select the tool based on the type of data he is analyzing. Rather, he or she

should be free to select the tool based on the specifics of the problem or task at Search Tools. Technology used to

hand. rapidly query large volumes of data,

usually unstructured, and quickly return

Clarabridge has the capability to pre-populate the metadata of the analysis tool relevant documents.

utilizing the tables, columns, attributes, facts, and metrics in the analysis









© 2006 Clarabridge, Inc. All Rights Reserved. Page 15 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









schema. Further, report templates are available out of the box for solutions areas, such as CRM and Investigation for

leading BI tools. This allows users to immediately begin analyzing the data present in the Analysis Schema without

performing tool customization or any application specific setup.



Users are now free to apply proven analytical approaches, which were previously impossible to perform against

unstructured data such as:



• Multi-dimensional analysis. Slicing or filtering data according to various dimensions, such as time or location



• Time-series analysis. Tracking how things have changed over time or determining the evolution of concepts



• Ranking analysis. Focusing on most critical items by ranking the top-10, bottom-10, etc.



• Market-basket analysis. Identifying what types of things typically are found with others or finding unexpected

relationships between people, places, or objects



• Anomaly analysis. Determining what events are unusual when compared with others or what items

unexpectedly disappear



Clarabridge provides the ability to drill through to the original unstructured source document. This allows an analyst to

completely understand the genesis of any result that they see in the structured analysis tool, to know exactly where

the data came from and how it was calculated, and to be able to drill all the way back to the original document or

documents to confirm and validate any element of the resulting structured analysis. And since the confidence

information is propagated through the system as described previously, analysts understand how reliable certain

metrics are. This provides for quality level context while analyzing data generated by Clarabridge.





Manage

To achieve the lowest total cost of ownership, the Content Mining Platform is easy to manage. Some of the keys to

manageability include:



• Eliminating custom coding and application customization. As you add more

sources, Clarabridge can be easily (or automatically) reconfigured to Clarabridge provides the

accommodate those sources.

ability to drill through to the

• Open platform. To ensure extensibility, the Clarabridge platform is open and

standards based, using a service-oriented architecture (SOA), and is built with original unstructured source

modern J2EE technology. It also leverages emerging standards, such as the IBM document. This allows an

Unstructured Information Management Architecture (UIMA).

analyst to completely

• Scalable architecture. The platform is designed in a multi-threaded, grid-friendly

distributed manner to allow for the parallel processing of extremely large amounts understand the genesis of any

of data through the system on a continuous real-time high throughput basis. result that they see in the

• Scheduling capabilities. Applications can be executed without human structured analysis tool.

intervention.



• Intuitive Clarabridge Analyst interface. Clarabridge leverages a patent-pending workflow paradigm to

integrate the various transformation approaches, utilizing a graphical user interface that requires minimal coding

or understanding of the underlying transformation technologies for a system administrator. This means it is

straightforward to modify an application over time.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 16 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









In addition to the above, the Clarabridge Content Mining Platform supports data lineage – or the analytical “path” that

led to a certain conclusion. Data lineage is implemented by capturing, in the solution, all the necessary information

required to help an analyst understand the sources of data presented in an analysis, the transformations performed

on an original data element (such as entity extraction, data quality processes, matching processes), the confidence

factor of transformations, and the dates and times of all processing steps. This information can be critical to the

proper understanding of the information being analyzed, and can help analysts “trace” reporting insights back to the

original systems of record.



The platform contains metadata, or “data about the data” permitting the presentation of data lineage insights for all

data that is processed. Because the metadata is built in, implementing data lineage functionality does not require any

special application development or configuration – it is a natural byproduct of the application creation process.





Benefits of This Approach

The benefits of the above approach are many and include broader analysis capabilities, faster ROI, and rapid time-to-

value.





Broader Analysis Capabilities

By having all access to all available data in an analytical framework, users spend more time analyzing and less time

retrieving and piecing together bits and pieces of unstructured information. They are free to apply proven analysis

approaches, such as anomaly analysis, to virtually unlimited data to detect trends, issues, and opportunities revealed

by their unstructured data using proven analytical approaches. Further, users enjoy a holistic view of their information

assets to enable analytical discovery across multiple problem domains, data types, and source systems.



This enables enterprises to create entirely new business applications to better serve customers, control cost and risk,

compete effectively, and drive profitability as demonstrated by the following simple examples:



Industry: Insurance, Application: Claims Fraud Detection



“What types of claims text, descriptions, comments, incidents, damage reports, scenarios, etc. indicate potential

anomalous or fraudulent claims that may merit further investigation?”



Data sources: Call center notes, claims historical archives, claims case files, structured claims and cost data



Industry: Insurance/Telecom/Financial, Application: Customer Retention



“How do particular patterns of communication over time between my company and its customers lead to either

the retention or loss of customers?



Data sources: Call center notes, emails, claims case files, structured claims and customer service data



Industry: Healthcare, Application: Drug Efficacy and Side Effects



“What symptoms disappear over time most often for patients prescribed Prozac for more than 24 months? What

new conditions or symptoms tend to appear most often with long-term use?”



Data sources: Doctor visit notes, hospital visit notes, claims history



Finally, managing the metadata about the sourced and transformed data, allows an analyst to completely understand

the genesis of any result that they see in the structured analysis tool. They know exactly where the data came from







© 2006 Clarabridge, Inc. All Rights Reserved. Page 17 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









and how it was calculated, and they can drill all the way back to the original document or documents to confirm and

validate any element of the resulting structured analysis.





Faster ROI

Organizations have made significant investments in business intelligence and data warehousing solutions. Many

have implemented multiple tools across their various divisions and functional areas. It makes little sense to deploy

another analytical tool into that environment. Those same organizations have spent hundreds of thousands, if not

millions, acquiring and training the staff necessary to run and utilize those tools. By leveraging a Content Mining

Platform organizations are able to rapidly extend the value of existing investments and their trained staff rather than

deploying new tools and analysis approaches.



The ability to extend the solution over time and swap in and out various components without any custom coding

provides much simpler ongoing administration and ultimately lower total cost of ownership. Further, the scalable and

open architecture will support evolving needs and growing data volumes.



Finally, source connectors to a wide variety of source systems deliver fast,

standard, data access when creating a new application. This becomes VALUE FROM TEXT

increasingly important as the number of data sources inevitably expands,

making it increasingly difficult to maintain point extraction code.





Rapid Time-to-Value Financial Services. Leverage

customer interactions to optimize

Rather than “re-invent the wheel”, enterprises are able to rapidly integrate and product features

leverage leading structured tools, including: statistical analysis, reporting,

visualization, search, data mining. Further, best-of-breed text processing Healthcare. Leverage clinical notes to

technologies can be rapidly incorporated into the Content Mining Platform. This improve disease management

is important because the best unstructured and structured tools vary from

Insurance. Leverage claims text to

application to application and evolve over time.

detect fraudulent activities

The platform allows less up front work when developing an application for a

Manufacturing. Leverage warranty

number of reasons. First the various components are treated as a “black box”,

claims text to uncover liabilities and

which makes them easy to leverage without understanding their inherent

trends

complexity. Second, a great deal of the difficulty in creating a BI application is

related to the ETL process and generating the target analytical data model as Public Sector. Leverage public filings

described above. By creating a universal capture schema, pre-defined ETL to expose suspicious activity

process, and an analytical schema, much of this effort is eliminated every time a

new application is created. Third, since reporting meta-data is automatically Retail. Leverage customer return

generated, users can immediately begin analyzing the data without performing information to change product mix

tool customization or any application specific setup.

Telecommunications. Leverage call

Finally, integrating multiple transformation approaches using a configurable center notes to reduce customer churn

workflow delivers greater data quality and gets end-users quickly using

unstructured content to address real-world business problems, rather than

spending months trying to perfectly design a solution up-front.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 18 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Appendix A. Suggested Evaluation Checklist

The decision to invest in technology to support a business intelligence application is an important one for any

organization – this is especially true for those that involve unstructured data. As your prepare a request for proposal

(RFP) for your unstructured data analysis needs, some suggested criteria to include in your evaluation





Source Criteria

√ Connect to a broad variety of data sources (FTP servers, Web servers, email servers, file servers, scanned

and OCRed paper files) without custom coding

√ Manage the process of extracting data from various sources

√ Enable concurrent communication with source systems

√ Integrate with existing commercial technologies to extract meaning from text

√ Allow for custom-developed technologies to extract meaning from text

√ Abstract underlying sourcing technologies as a “black box”

√ Capture text, data, and metadata that are generated by sourcing and extraction services

√ Assemble common metadata format across all sources and transformations

√ Enable link back to original source document





Transform Criteria

√ Allow unstructured data to be captured and passed through one or more embedded, custom, open-source,

or commercial transformation components

√ Provide coordinated workflow process, so the results of one or more transformations may serve as an input

to downstream transformations

√ Provide integration to wide variety of data transformation technologies

√ Capture results of transformation in a database

√ Retain complete metadata and links back to the original source data

√ Allow analyst to drill into the genesis of an analytical result or metric

√ Enable semantic distillation of unstructured data





Store Criteria

√ Ensure architecture is highly scalable to 100s of terabytes or more of data

√ Utilize capture schema with hooks back to underlying source data

√ Capture schema is application-independent and does not require custom design for every new applications

√ Utilize analysis schema that supports commercially available analysis such as BI, data mining, link analysis,

mapping, visualization, reporting and statistical

√ Enable access to analysis schema via other applications (e.g., open)

√ Provide ETL to move data automatically from capture schema to analysis schema

√ Allow data to be updated as required, including: incremental, real-time, and batch.

√ Support any commercially available DBMS









© 2006 Clarabridge, Inc. All Rights Reserved. Page 19 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Analyze Criteria

√ Provide a web services framework for structured analysis tools to access and analyze data contained in the

analysis schema

√ Allow data to be easily joined with existing structured data sources

√ Provide direct access via a variety of commercially available analysis tools, such as BI, data mining, link

analysis, mapping, visualization, reporting and statistical, without custom coding

√ Automatically pre-populate the metadata of the analysis tool utilizing the tables, columns, attributes, facts,

and metrics contained in the analysis schema

√ Enable application of proven analysis techniques, such as multi-dimensional analysis, time-series analysis,

ranking analysis, market-basket analysis, and anomaly analysis

√ Provide the ability to drill through to the original unstructured source document

√ Join together many data points that are captured throughout the flow of data to create a weighted statically-

oriented calculation of the confidence that can be assigned to any point of data





Manage Criteria

√ Enable reconfiguring of platform as new data sources or transformations are added without custom coding

√ Support open standards, including a service-oriented architecture

√ Support multi-threaded, grid-friendly, distributed computing architecture

√ Enable scheduling of applications

√ Supports meta-data management throughout analytical lifecycle

√ Provide user-friendly administrative interface









© 2006 Clarabridge, Inc. All Rights Reserved. Page 20 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Appendix B: Technical Glossary

• Anaphora Resolution. Anaphora resolution refers to linking pronouns such as “his”, “her”, “their”, and “it” to the

correct people, places, or things mentioned earlier in a piece of text.



• Business Intelligence (BI) Tools. Technologies to enable raw data to be transformed into valuable information

for analysis and decision making. Features usually include dashboards, reports, ad-hoc analysis, and OLAP

analysis. Tools include Cognos, MicroStrategy, Business Objects, Actuate, and Pentaho.



• Data Lineage. The analytical “path” that led to a certain conclusion. This information can be critical to the proper

understanding of information being analyzed. Also refers to the path that a certain data element or value took all

the way from source(s), through various transformations, to the resulting analysis.



• Data Mining Tools. Tools used for pattern detection, anomaly detection, and data prediction against large sets

of numerical data. Example tool vendors in this area include Angoss, IBM Intelligent Miner, and SAS.



• Concept extraction. Also known as topic extraction, concept extraction involves understanding the underlying

concept that a document or section of a document is describing. Techniques, such as automatically applying

categorizations can be used for concept extraction.



• Categorization and topic extraction. This is the process of grouping various documents or entities within

documents into various buckets. Typically this is done with a dictionary, thesaurus, ontology, or taxonomy.



• Data Visualization and Mapping Tools. Tools used for visually describing, presenting, and analyzing data. For

example, a link analysis tool, such as I2, would be used to visually show the connections between various

people, events, and places. Mapping tools such as ESRI and MapInfo show spatial interrelations between data

on maps.



• Data Warehouse (DW). A data warehouse is a database that contains a record of an organizations’ past

transactional and operational information designed for efficient data analysis and reporting.



• Entity extraction. Determining all people, places, objects, etc. within a document. Tools include Inxight,

Aerotext, Lingpipe, GATE, and NetOwl.



• Extract, Transform, Load (ETL). The process whereby structured data is sourced from multiple data

repositories; transformed to allow it to be cleansed, merged with other data, or manipulated for analytical

purposes; and loaded into a data warehouse for analysis.



• Natural Language Processing (NLP). The understanding and manipulation of natural language to enable

computers to "understand" the meaning of written human languages.



• Online Analytical Processing (OLAP). Sometimes called dimensional analysis, OLAP, is the process of slicing

data by various dimensions (time, dollars, product line, etc.) to see summary and detailed data for decision

making.



• Ontology. An understanding of the “fundamental categories of things in the world.” For analysis purposes,

ontologies can help to group various entities to better understand the context of that entity within a broader

domain and how it relates to other entities. They can also be used to filter out certain types of entities that are

undesired for a particular analysis.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 21 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









• Relationship extraction. Also called event extraction, transaction extraction, and fact extraction, relationship

extraction is the process of determining how various entities relate to one another. For example, if “Person A

took a trip to visit Person B”, the relationship between the two entities is “A visited B.” Relationships can also

include attributes or facts about specific entities, such as the number, 35, refer to “dollars” in a particular

sentence. Tools include Attensity and Clearforest.



• Semantic Distillation. Essentially, this means that all available information is sourced from the content as

efficiently as possible using a number of pre-configured transformations. Although a certain threshold of quality is

required, quantity is typically desired over exacting quality for applications using semantic distillation.



• Statistical Analysis Tools. Tools useful for the collection, analysis, interpretation and presentation of masses of

numerical data. Statistical calculations, such as linear regressions, variances, and means are typically applied to

the data using tools such as SAS or SPSS to test various hypotheses about those data.



• Structured Data. Content which has structure that is easily interpreted by a machine, commonly in a database

or XML format.



• Taxonomy. A hierarchy of “things.” For example, a geographical taxonomy may include relationships between a

city, state, and country. This is useful for “drilling into” data from a higher level down to lower levels of detail.



• Text analysis application. An application used for extracting meaning from content. For example, part-of-

speech detection, grammatical parsing and named-entity recognition. There are many of these applications, and

they are all good at various functions within various information domains. However, when used independently,

they are not useful for analysis.



• Text processing. See text analysis application.



• Unstructured Data. Content which does not have a structure that is easily interpreted by a machine. Examples

of unstructured data may include audio, video and unstructured text such as the body of an email or word

processor document.









© 2006 Clarabridge, Inc. All Rights Reserved. Page 22 of 23

All other trademarks and logos are property of their respective owners.

Converging Text and BI: the Case for an Unstructured Intelligence Platform









Founded in 2005 by leading experts in the Business Intelligence (BI) industry and

backed by a premier venture capital investment partner, Clarabridge is an emerging

leader in helping private and public sector enterprises leverage unstructured content

to provide critical operational and strategic business insight. Unlike traditional

approaches that are inflexible, expensive, and time consuming, Clarabridge’s

patent-pending software uniquely combines the best of the structured and

unstructured analysis worlds, allowing enterprises to greatly extend the value of their

existing BI investments. Clarabridge is the only enterprise-class solution that rapidly

enables users to directly mine text alongside existing structured data, using

standard BI tools and analysis techniques, to address a host of real-world business

needs.



© 2006 Clarabridge, Inc. All Rights Reserved.

All other trademarks and logos are property of their respective owners.









11400 Commerce Park Drive, Suite 500

Reston, VA 20191

P: 703.663.2500 │ F: 703.269.1505





www.clarabridge.com



© 2006 Clarabridge, Inc. All Rights Reserved. Page 23 of 23

All other trademarks and logos are property of their respective owners.


Share This Document


Related docs
Other docs by Lisa Wenner
Walker Greenback Research oct 2007[1]
Views: 29  |  Downloads: 0
How To Prevent Shock
Views: 182  |  Downloads: 12
Account Configuration Problem-White Paper[1]
Views: 59  |  Downloads: 0
ContentFilm Research July 2006[1]
Views: 17  |  Downloads: 0
BrainJuicer Research[1]
Views: 57  |  Downloads: 0
Eckoh Research April 2006[1]
Views: 21  |  Downloads: 0
by registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!