Converging Text and BI:
The Case for a Content Mining Platform
Although enterprises commonly utilize business intelligence (BI) tools
against structured data for analysis and decision making, leading
organizations recognize that they must take a more holistic view of their
information assets and find ways to creatively analyze the exponentially
growing universe of unstructured content - contracts, press releases, filings,
forms, call center notes, medical records, insurance claims, web content,
emails, etc. This white paper describes how the Clarabridge Content Mining
Platform™ avoids the pitfalls of previous approaches to unstructured
analysis, and capitalizes on lessons learned from solving similar problems
in the structured domain. A platform approach enables enterprises to
efficiently and effectively source, transform, store, and analyze unstructured
data alongside structured data – in a way that is easy to manage. The result
is broader business understanding, the ability to leverage existing
resources, and the freedom to rapidly apply the most appropriate decision
support interface.
WHITE PAPER
March 6, 2006
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Executive Summary
It is becoming ever more important for today's agile enterprise to use the best available data to drive strategic and
operational business decisions. Although most companies deploy business intelligence (BI) tools against structured
data to answer a wide variety of questions, leading organizations are increasingly recognizing that they must take a
more holistic view of their information assets. They find creative ways to analyze the exponentially growing universe
of unstructured content – contracts, press releases, research papers, filings, call center notes, medical records,
insurance claims, web content, emails, etc. This content when understood and analyzed alongside structured data
provides business insight that enables organizations to better serve customers, control cost and risk, compete
effectively, and drive profitability.
Text processing technologies are rapidly maturing
to enable concept/entity extraction, relationship BI, OLAP,
Unstructured Data Search Structured Data
Reporting
tagging, and other paradigms to allow more
structure to be applied to unstructured data. Knowledge Worker
Search technologies are evolving to provide end Currently end users have separate interfaces for Structured and
users with better ways to retrieve text, but provide Unstructured data : Search for Unstructured , and BI for Structured .
limited to no analytic insight, which makes the
determination of precise answers to questions
•Call center notes How can we improve satisfaction? •Customer demographics
time consuming, tedious, and increasingly more
difficult. Even more advanced implementations of
text processing technologies require complex •Warranty repair notes What is root cause of problem? •Service ticket & outcome
programming work and are, like search engine
technologies, totally disconnected from time- •Clinical Notes How do symptoms change over time? •Patient records
tested analysis approaches used in the BI world.
So how can enterprises better enable users to Figure 1 – Currently structured and unstructured analysis are done in
spend their days making informed decisions different ways and with different tools.
versus gathering data? Fortunately, there is much
to be learned from two decades of struggling with similar problems in the structured data world. We now know as
needs change and evolve, organizations will require the flexibility to integrate the most appropriate text processing
technologies to extract desired information. They must enable users to apply time-tested analytical approaches that
can be modified or expanded upon as understanding of issues and opportunities emerges from the data itself. For
example, a call center should be able to apply a multi-dimensional analysis (i.e., “slice and dice”) to call center logs
and email text for assessing trends, root causes, and relationships between issues, people, time to resolution, etc.
Organizations should have the infrastructure, storage, and user interfaces to process and efficiently explore large
volumes of data. And they need to easily leverage their existing BI and data warehousing (DW) tools presently used
only for structured data analyses, to analyze unstructured data alongside structured data.
As organizations adopt analytical approaches to unstructured data, they will need to address a number of challenges:
• Data comes from multiple unstructured repositories (file servers, document management systems, intranet sites,
internet sites, database notes fields, etc.)
• Data in unstructured documents is of widely varying quality (often much more so than structured)
• The use of different types of unstructured data tools varies greatly from environment to environment and from
problem to problem.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 2 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
• In many cases maximum value in analyzing unstructured data comes from analyzing it alongside existing
structured data in data marts or data warehouses
Fortunately, many of the challenges with unstructured data analytics can be overcome by applying lessons from the
BI and DW sectors. Over the past 10 years departmental, point solutions of the early 1990’s rapidly evolved to more
robust solutions that leveraged enterprise data warehousing platforms, an extract, transform, and load (ETL)
infrastructure, and scalable, server-based BI or reporting solutions.
To be successful in the unstructured world, organizations need a platform to leverage their existing BI investments
and also efficiently and effectively source, transform, store, and analyze unstructured data – and do so in a way that
is easy to manage and scale. That was the vision behind the Clarabridge Content Mining Platform™. The Clarabridge
Content Mining Platform enables enterprises to:
• Source. The Clarabridge platform connects to a variety of source systems and data types.
• Transform. Once Clarabridge sources the unstructured data, a variety of out-of-the box and third party tools help
to ensure it is understood, merged, and integrated with other structured and unstructured data sources.
• Store. As unstructured data volumes explode, Clarabridge responds with a highly scalable architecture that
utilizes proven data warehousing techniques and platforms.
• Analyze. End-users are able to efficiently analyze large volumes of data, using whatever analytical technique or
tool they feel is appropriate for the problem at hand.
• Manage. As the application evolves and grows, the IT organization does not have to maintain lots of custom
coding or extensions with Clarabridge. Further, the architecture scales and integrates into existing efforts.
Using the Clarabridge Content Mining Platform enables users to directly mine text alongside existing structured data,
using standard BI tools and analysis techniques, to address a host of real-world business needs. The benefits are
enormous and include:
• Broader analysis capabilities. Users spend more time analyzing and less time retrieving text. They are free to
apply proven analysis approaches to virtually unlimited data to detect trends, issues, and opportunities revealed
by their unstructured data. Further, users enjoy a holistic view of their information assets to enable analytical
discovery across multiple problem domains, data types, and source systems.
• Faster ROI. Organizations have made significant investments in BI and DW solutions. Customers are able to
rapidly extend the value of those investments and their trained staff rather than deploying new tools and analysis
approaches.
• Rapid time-to-value. Rather than “re-invent the wheel”, organizations are able to rapidly integrate and leverage
leading unstructured and structured tools, including: statistical analysis, reporting, visualization, search, data
mining, and text processing engines. Leveraging a platform allows less up-front work when developing an
application and delivers fast data access, data quality, and time to analysis.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 3 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Tapping Unstructured Data to Drive Business Value
Organizations today are buried in unstructured content such as contracts, press releases, research papers, forms,
filings, call center notes, medical records, insurance claims, web content, emails, etc. Experts agree this content
represents more than 80% of an organization’s data. And the amount is growing every day. Furthermore, in an
increasingly services-based economy, unstructured transcripts, notes and documents describing business activity
provide important insights about customer’s habits, tastes, product use and support requirements, employee work
habits and performance, and business process efficiencies and failures.
It is becoming ever more important for today's agile enterprise to utilize the best available data to drive strategic and
operational business decisions. Although most companies utilize business intelligence (BI) tools against structured
data to answer a wide variety of questions, leading organizations are increasingly recognizing that they must find
ways to creatively analyze the exponentially growing universe of unstructured content.
Unfortunately, the structured and unstructured data analysis domains have traditionally been separated along a
number of dimensions including analysis approaches, storage, and staff. Typically analysis of unstructured data
involves using a search tool to find documents containing information you are looking for, whereas structured data
involves using BI or data mining tools to report on performance indicators, trends, changes over time or other
quantitative metrics of business activity. Unstructured data is typically stored in file-based servers (such as web
servers, document management servers, etc), while structured data is almost always stored in relational database
management systems (RDBMS). Lastly, staff trained in BI are typically not skilled in the linguistic and other
specialized techniques required
for analyzing unstructured content “Unstructured” Information “Structured” Information
and thus rarely use the tools and
technologies associated with
unstructured data analysis. What
Web Content/
is needed is a way to converge the Documents
Intranets
Data
Data Marts
Warehouses
two domains, leveraging the best
from both, to unlock the true
Spreadsheets Emails CRM, ERP, etc.
potential of unstructured data. Systems
When understood and analyzed
Web Content Paper Files Operational Data
alongside structured data, Metadata
Stores
unstructured data provides
business insight that enables
organizations to better serve
customers, control cost and risk,
and identify opportunities for Figure 2 – Unstructured and Structured information have traditionally
increased efficiency. Example use been separated along a number of dimensions.
scenarios include:
• Automatically identifying top issues in (unstructured) call center logs and proactively routing calls to the right
person based on the issue can save millions through reduced call time, not to mention improved customer
service.
• Identifying and addressing the top types of problems encountered by the most profitable customers can help
reduce loyal customer churn.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 4 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
• Rapidly detecting emerging product trends in problem-reports coming in from all over the globe can avoid recalls,
lawsuits, potentially saving companies millions of dollars.
• Analyzing patient comments, doctor notes, and symptom data can lead to better disease management and
identification of new uses for drugs.
• Capitalizing on customer feedback following a product launch can help adjust marketing campaigns months
ahead of competitors.
• Reducing hundreds of boxes of documents down to the two that are relevant as part of the legal discovery
process reveals previously hidden information in less time than if all documents were read by human beings, and
focuses critical resources on higher value tasks.
• Analyzing communications patterns, claims data and patient records to identify insurance fraud can significantly
reduce fraudulent claims.
• Automatically mining thousands of SEC reports to predict poor corporate governance can help identify issues
before they turn into major crises.
The Evolution of Unstructured Analytics
If the potential is so great, why aren’t more organizations employing unstructured analytics? In large part the
underlying technologies for unstructured analytics have only recently matured to support the types of analysis
suggested above. Text processing technology has progressed from first generation keyword search to second
generation point text analysis applications. Reviewing these first two generations of solutions, we begin to see the
potential of unstructured data, but we also see that these technologies are inadequate for true unstructured analytics.
This has driven the need for a third generation solution: a Content Mining Platform. With a Content Mining Platform,
organizations can finally unlock the value of all their information assets,
revealing a wealth of intelligence about their customers, competitors,
With a Content Mining Platform,
suppliers, and internal operations.
organizations will finally unlock the
First Generation Solutions: Search value of all their information assets,
The first generation text analysis technologies performed “search” – providing revealing a wealth of intelligence about
keyword search to help a user find documents containing the searched for
words and concepts described by the keywords. While great for retrieving their customers, competitors, suppliers,
and grouping keywords within documents, these tools have many well-known and internal operations.
problems that make them impractical to use for unstructured analytics:
• Inability to store and quantify changes over time. They are unable to
easily integrate with databases to efficiently store results and changes over time, and thus are unable to track or
quantify the evolution of ideas, or the changes in activity levels of tracked people, processes, or organizations
that may be searched.
• Simple interface. Search tools were designed to be easy to use, which restricts their analysis capability to simple
Boolean (not/and/or) expressions.
• Inability to extract meaning. Although great at rapidly returning documents, a user still must take the time to read
through the returned documents to extract meaning from them.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 5 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
• Use keyword matching, not semantic understanding. A search tool relies on the user to identify the right
combination of keywords to extract the desired information. Unfortunately, keywords can have different meanings
in different contexts, which results in many irrelevant responses while at the same time excludes relevant ones
that don’t happen to contain the original keywords.
As a result of these limitations, search typically requires a great deal of effort on the part of the user to manually sift
through documents and connect bits and pieces of information to make decisions from unstructured data. Further,
there is no way to analyze large volumes of unstructured search results alongside structured data. For example,
many law firms hire paralegals or junior lawyers to manually sift through documents using search interfaces during a
discovery process to tag those that are relevant.
Although suitable for some applications, first generation search technology is time consuming and imprecise when
used for complex business decision making. And it can be very expensive as highly compensated individuals are
used for lower-level tasks like reading through documents to look for routine information.
Second Generation Solutions: Point Text Analysis Applications
The limitations of the keyword search applications led to a second generation technology, point text analysis
applications. Many tools exist to solve a variety of problems related to understanding the true meaning of a
document. These tools can scan a text document, for example, and pull out chemical names and their interactions, or
identify events, locations, products, opinions about products, problems, methods, etc. Vendors may call their products
“Entity Extraction, “Concept Extraction”, or “Name Matching” products. The technologies all tend to be stove-piped in
that they solve a specific problem or work in a specific functional domain.
While these tools perform the valuable task of helping to resolve documents to a more granular level – for example
identifying and linking actors and events with each other by intelligently parsing and organizing the concepts
contained in a document – these tools still present a number of challenges when organizations try to use them for
analysis and decision making, including:
• Stove-piped. A solution involving such technologies may not be integrated with standard document
management, database management, data migration, metadata management, and data analysis tools. Further,
any integration to these platforms can require custom programming to
create enterprise application integration (EAI) solution.
…the problem with these second
• Offer varying degrees of insights. Each tool tends to be good at specific
functions, and at extracting specific types of information, but not others.
generation approaches is that they
One tool may be very good at extracting names, while another may be require a precise understanding of the
good at extracting events, but no one tool can perform both tasks well. To
get the full value of all tools a customer is faced with the task of again problem up-front and an enterprise
manually integrating the products to each other. architecture that never changes.
• Questionable scalability. These tools are unable to easily integrate with
database to efficiently store results and changes over time, and many are
unable to handle volumes of data in the 50GB – 100TB range
• Different types of users. Users must have a linguistic background or have specialized training that doesn’t
commonly exist in an organization.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 6 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
• Require manual rule building up-front. Many tools require entity extraction rules to be defined up-front about
the text being analyzed which presumes that the problem is fully defined up-front – which is rarely the case.
Further, this means that it takes months to even begin an analysis while rules are defined, coded, and refined.
• Incomplete analysis. Because each tool is meant for a specific purpose, there are limits to the types of analysis
that can be performed. Once each analysis is complete, the whole system is often shut down and the data is
thrown away.
• Re-invent the wheel. Although analysis interfaces surpass those of search, they lack the interface or analytical
maturity of existing structured analysis tools, and are not easily compatible with market leading business
intelligence, statistical, or visualization tools.
• Multi-vendor solutions require custom coding. Combining two or more solutions typically requires a great
deal of custom development. For example, the FBI has notoriously had a difficult time combining the results of
multiple extraction solutions into a single consolidated view of a situation. A recent CNN article reported, “The
current program requires FBI personnel to manually enter, print, sign and scan their information into the
investigative data warehouse."
In short, the problem with these second generation approaches is that they require a precise understanding of the
problem up-front and an enterprise architecture that never changes. In the real world, this never happens. Evolving
requirements ultimately drive the need for custom modifications and extensions, which are increasingly costly, less
scalable, and less maintainable over time. Further, the approach requires training employees on new analysis
techniques, who in many cases, have spent years with traditional BI tools and would prefer to leverage the strengths
of existing tools against unstructured data.
Not surprisingly, second generation applications are analogous to point BI applications of the early 90s, as described
in the next section. Although it is possible to perform limited unstructured analytics with these point-solution tools,
forward-looking organizations understand the pitfalls of being strictly locked into a point-application.
Lessons Learned from the BI and DW Worlds
We have learned quite a bit, from nearly two-decades of BI and DW experience, which can be applied to the
unstructured analytics domain.
AN “ETL” PARADIGM FOR UNSTRUCTURED ANALYTICS
First, by looking at the history of data integration we know that enterprises will need an “ETL” paradigm for the data
that will drive unstructured analytics. In the ‘early days’ of data warehousing of the early 1990s, organizations
originally created “stove piped” or “point reporting” solutions that reported against a single data source under a well-
defined set of business rules. In many organizations, for instance, financial
reporting solutions were created by creating point applications to extract
…in the unstructured world, we
general ledger and accounts payables/receivables tables from a financial
system, applying business rules to the data, and loading the resulting see an even more compelling
information into a reporting database – with reporting applications created
against the reporting database. Similar applications were created for sales
need for an “ETL” approach to
reporting, customer relationship analysis, inventory/supply chain reporting, enable analysis…
and other business departments or functions. Most enterprises had
dozens of these applications.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 7 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
This approach became an issue for several reasons. First, as organizations evolved to cross-functional views of their
enterprise to focus on business drivers, such as “customer intelligence” and “risk” or “costs”, they needed to merge
data from multiple sources. Because modifying one application caused a downstream and upstream impact on all
others, point solutions became too difficult to maintain and costs quickly spiraled out of control. Second, as
organizations realized that the original problem they were trying to solve had changed, they inevitably needed to go
back and add more business rules or extract more data. Clearly, maintaining custom code becomes very difficult.
Finally, these two issues magnify one another, creating an exponentially growing problem.
In the structured data realm, integrating stove-piped data to support aggregated and well supported decisions is not a
new problem. Over the past 10+ years this problem has been a central focus of large organizations building data
warehouses, and decision support, or “business intelligence” applications to analyze structured data. Many data
warehouse technologies are now mature enough to support high-quality extract, transformation and loading (ETL), or
“fusion” of information for analysis. Rather than re-coding, the ETL platform can be simply re-configured to
accommodate new business rules and data sources. However, these technologies almost exclusively focus on the
integration of structured data.
Unstructured data requires additional pre-processing before it can be loaded into a repository for analysis – data must
be extracted from documents, tagged by entity, concept, and relationship, cleansed and transformed to ensure data
quality, and finally loaded into a repository so that analytical tools can be used against it. Complicating matters
further, unstructured data tools and technologies are not well integrated, and often require custom integration with
each other in order to create a robust solution. To perform robust processing and analysis of unstructured data,
multiple products may be required to perform functions such as:
• Entity Extraction
• Concept Determination
• Industry specific thesaurus matching
…as companies start getting
• Data quality (to resolve varying forms of the same word, typos, and “dirty
data”) insights from unstructured data, a
• Name matching (to resolve identities, products, foreign word spellings) wealth of new opportunities will
• Statistical reporting
emerge, making it nearly
impossible to pre-determine the
• Business Intelligence/data mining
questions you will need to answer
• Link analysis
using your unstructured data.
• Visualization and spatial imaging and mapping
• Ad-hoc query creation
To merge structured and unstructured information for analysis a solution must address integration, administration,
and presentation of the data through a variety of technologies and processes. In short, in the unstructured world, we
see an even more compelling need for an “ETL” approach to enable analysis because there are:
• More data sources and types;
• More data transformations; and
• More types of analysis possible.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 8 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
This creates a many-to-many-to-many situation between source, transformation, and targets respectively, which will
make point-solutions very difficult to maintain over time. Further, as companies start getting insights from
unstructured data, a wealth of new opportunities will emerge making it nearly impossible to pre-determine the
questions you will need to answer using your unstructured data.
Clearly an “ETL” paradigm is critical to deploy enterprise class unstructured analytics.
LEARNING FROM ENTERPRISE BI DEPLOYMENTS
Besides the need for an “ETL” paradigm, lessons learned from world class enterprise BI and DW implementations tell
us that:
• Data volume / scalability will be important. Structured databases are commonly in the multi-terabyte sizes.
Unstructured data, which is five times more prevalent, must also be stored in a highly scalable architecture.
• Users need flexibility to choose any analysis tools. We know from experience with structured analytics that
users will want to analyze unstructured documents using popular BI, Search, or other analysis interfaces.
• A relational data warehouse is required to do real analytics. Relational data warehouses are highly scalable
and have the most analytical flexibility, which will be as important for unstructured analytics as it is in structured.
• Analytical approaches applied to structured data also apply to unstructured content. Users will need the
ability to utilize existing analytical approaches, such as: multi-dimensional analysis, time-series analysis, ranking
analysis, market-basket analysis, and anomaly analysis.
• Leveraging best-of-breed is important. As is true in the structured world, a next generation unstructured
analysis solution must integrate a variety of independently developed technologies. These must be integrated to
perform a comprehensive analysis task and then their results must be funneled into systems that allow users to
rapidly find and exploit the discovered knowledge, for example, search engines, databases and/or knowledge
bases.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 9 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Ushering in the Third Generation of Unstructured Analytics
We can see that a new approach is needed to unstructured analytics. This new approach is innovative, but
fortunately the underlying technologies are proven. The Clarabridge Content Mining Platform is designed to avoid the
pitfalls of the first two generations, capitalize on lessons learned, and be successful in the unstructured analysis
future. Clarabridge effectively converges the text and business intelligence worlds. Clarabridge takes “ETL” a step
further in actually providing a pre-packed analytical database as well as connectors to various analytical front ends.
Specifically, Clarabridge employs a platform approach to efficiently and effectively source, transform, store, and
analyze unstructured data – and do so in a way that is easy to manage as depicted in the below figure.
Figure 3 – Components of the Clarabridge Content Mining Platform.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 10 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
• Source. The Clarabridge platform connects to a variety of unstructured and structured source systems and data
types, such as file servers, web servers, enterprise content management systems, document management
systems, enterprise applications, and many other systems containing structured and unstructured data.
• Transform. Once data is sourced, it must be understood, merged, and integrated with other structured and
unstructured content. Clarabridge performs a variety of transformations to unstructured data (i.e., concept
extraction, natural language processing, data matching, table extraction, etc.), either out-of-the-box or through
integration with other technologies.
• Store. As data volumes explode, the Clarabridge platform responds with a highly scalable architecture that
utilizes proven data warehousing techniques and platforms. Further, it uses a pre-defined schema for staging
and storing the data extracted from unstructured sources and includes a process to easily integrate transformed
data into a structured data mart, or warehouse for analysis.
• Analyze. Using Clarabridge, end-users are able to efficiently analyze large volumes of data, using whatever
analytical technique or tool they feel is appropriate for the problem at hand. The platform has connectors to a
variety of analysis interfaces, such as BI tools, data mining, visualization, and statistical analysis tools.
• Manage. As the application evolves and grows, there is little need for the Information Technology (IT) shop to
maintain lots of custom coding or extensions. Further, the architecture scales and integrates into existing efforts.
Clarabridge Content Mining Platform takes raw content and enables that content to be directly mined using any type
of analytical interface. The following sections describe the key capabilities delivered through the Clarabridge Content
Mining Platform to accomplish this result.
Source Examples of unstructured
There are many unstructured data sources and a primary objective of any sources include text fields (e.g.,
business application will be to achieve fast data access to any of those sources,
as well as a clear understanding of the original sources of that data as BLOB fields) in databases, FTP
downstream analysis progresses. Examples of unstructured sources include servers, Web servers, e-mail
text fields (e.g., BLOB fields) in databases, FTP servers, Web servers, e-mail
servers, file servers, document management systems, knowledge management servers, file servers, document
systems, enterprise search tool repositories, or scanned and OCRed paper
management systems,
files.
knowledge management
Clarabridge contains Source Connectors that interface with various sources and
repositories of unstructured data and manage the process of extracting data systems, enterprise search tool
from these various sources. Concurrent communication with a wide variety of
repositories, or scanned and
occurs via APIs, web services, or other methods. This allows the sources to be
treated as a “black box” by the rest of the process components. OCRed paper files.
The Content Mining Platform processes the text, data, and metadata returned
from the Source Connectors. It then converts the various outputs from the various unstructured source systems into a
consistent schema and format. At the same time, the various pieces of metadata that are also extracted from the
source systems are assembled into a common metadata format. The Clarabridge Extraction Connectors assign a
unique index key to each extracted source document, which allows it to be consistently traced as it moves through
the rest of the system. This key, and the associated metadata stored regarding the source location of the text, also
provides a mechanism to link back to the original text when desired during the course of analysis.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 11 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Transform
Once Clarabridge sources the unstructured data, its next function is to extract meaning, thereby transforming the raw
content into rich, structured data. This process is termed Semantic Extraction. Next, Clarabridge applies a series of
data quality and staging processes to ensure that data is of high quality and ready for analysis. Accomplishing both
requires a layer of transformation components as well as the ability to consolidate all structured and semi-structured
data, metadata and other findings into a single repository.
SEMANTIC EXTRACTION
There are all kinds of technologies being developed in industry and academia for semantic extraction of content.
Some, for example, specialize in part-of-speech detection, grammatical parsing and named-entity recognition where
proper names, organizations and locations are identified – usually when combined with a dictionary or thesaurus.
Other technologies may specialize in detecting events and times and then others work on detecting relationships
between these elements. Still others are good at extracting objects from content, such as logs, signatures, or tables.
The Clarabridge third generation solution has the flexibility to extract meaning from any content through embedded
functionally, integration with existing commercial or open source technologies (such as the second generation point
analysis tools), or through custom-developed components that are plugged-in. As technologies become obsolete or
irrelevant, it is easy to swap out older transformation components in favor of better ones without requiring new coding
or breaking the rest of the application components.
Various Clarabridge Transformation Components provide a variety of value-added semantic extraction capability as
demonstrated by the following components. Note that Clarabridge provides all of the below transformation
functionality natively within the application. However, for specific applications, it may be necessary to leverage the
second generation point analysis applications described previously from within the Clarabridge environment. For
those situations, Clarabridge has built-in connector to all of the best-of-breed text processing and transformation
components.
• Document segmentation and categorization. Transformations that are applied at the document level. For
example a process groups documents into various buckets such as “violent events” or “press release”, or
“financial transaction” through statistical analysis of the underlying content. Another process can automatically
segment the document to identify various headers,
sections, or objects within the document, such as
signatures or logos using rules or other semantic
approaches.
• Entity extraction. Functionality to determine all people,
places, dates, financial amounts, objects, etc. within a
portion of text. Basically this involves identifying and
categorizing the various proper nouns and objects
mentioned in the text.
• Event, relationship, & fact extraction. The process of
determining how various entities related to one another.
For example, if “Person A took a trip to visit Person B”,
the relationship between the two entities is “A visited B.”
Relationships can also include attributes or a fact about
specific entities, such as the number, 35, refers to Figure 4 – Example transformation workflow
© 2006 Clarabridge, Inc. All Rights Reserved. Page 12 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
“dollars” in a particular sentence.
• Table data extraction. Takes any type of text table that is readable by a human and can convert the table into
structured rows, columns, cells, headers, and multiplier representation that can then be used for further
structured analysis.
• Image extraction. Takes any embedded object, such as a logo, photo, or signature, within a document and tries
to interpret that object, typically by matching it to a database of possible objects.
DATA QUALITY AND STAGING
A final, but critical transformation step involves ensuring that the data has the proper quality and is enriched for
analysis. The following capabilities are provided within the Clarabridge Content Mining Platform, although third party
data quality components can also be integrated into the platform:
• Data Matching. Achieves better data quality by applying various cleansing techniques to resolve, for example,
different spellings of a single entity across two different documents.
• Data Merging. This involves combining like data from structured and unstructured
Since unstructured data is
data sources or enriching data by associating additional attributes about that data
from across documents. often imprecise, the ability to
• Dimensions / Hierarchy. Organizes data along relevant dimensions, such as time understand the confidence
or geography, to allow data to be “sliced and diced” along a number of dimensions.
This may also include the integration of ontologies, dictionaries, taxonomies, or
level of any findings is critical.
thesauri, which are used to organize data into various hierarchies and relationships
for further analysis.
• Confidence Level. Since unstructured data is often imprecise, the ability to understand the confidence level of
any finding is critical. Employing confidence analysis into the platform allows users to not only see and analyze
data within structured analysis tools, but also to calculate a numeric confidence level for each data element or
aggregate data calculation. The platform joins many data points that are captured throughout the flow of data to
create a weighted statically-oriented calculation of the confidence that can be assigned to any point of data.
TRANSFORMATION WORKFLOW
Clarabridge’s third generation solution includes a transformation workflow engine which manages the process of
taking the collected unstructured data and passing it through one or more transformation components.
Transformations are run in a coordinated process, as the results of one or more transformations may serve as an
input to downstream transformations. Further, Clarabridge provides a common API to a wide variety of custom, open
source, or third party unstructured data transformation technologies so each of those transformations can operate as
a “black box” abstraction from the rest of the system. Clarabridge retains complete metadata and links back to the
original source data, which allows end users to trace back through the transformation that took place and from there
back to the original source of unstructured data.
As an example, assume a regulatory organization wanted to mine SEC filings to identify related party transactions
that may indicate fraudulent activity. Using an FTP source connector to Edgar Online, the thousands of pages of
publicly available SEC filings could be accessed via the platform. Once loaded into the Capture Schema, those
documents could be run through a transformation workflow to extract sections, headers, tables, and related party
transaction information from the documents. Each of these transformations requires different technologies and
configurations, but data and metadata are managed through the transformation workflow process to allow those
© 2006 Clarabridge, Inc. All Rights Reserved. Page 13 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
components to be accessed without coding or understanding their technical underpinnings. Since the resulting data is
in a database, it is easily merged with structured financial data. Using the analysis components described in the next
section, investigators could easily see trends, statistical anomalies, and “slice and dice” companies, industries, and
transaction types to root out suspicious behavior.
SEMANTIC DISTILLATION: WHAT IT IS AND WHY IT IS IMPORTANT
For many applications involving unstructured data analysis, it is important to
perform a “semantic distillation” of the relevant source information.
Essentially, this means that all available information is sourced from the THE ESSENCE OF DATA MINING
content as efficiently as possible using a number of pre-configured
transformations. Although a certain threshold of quality is required, quantity is
Examine many combinations of
typically desired over exacting quality for these applications. This makes
parameters/variables; not just
sense for two reasons.
“obvious” ones
First, for many applications, finding a precise answer to a question is not as
Churn through millions of calculations
important as finding anomalies over large data sets. Continuing on the above
searching for patterns, relationships,
example, if an investigator is reviewing SEC filings to detect clues related to
anomalies
poor corporate governance, he would be interested in reviewing all related
party transactions identified across those filings. He might be interested in Apply multiple algorithms: linear
seeing which seem unusual when compared to industry mean activity. In this regression, trees, neural networks,
situation, the investigator is not actually looking for a specific example of poor graphs
governance. He is looking through large volumes of data to find unusual
patterns that could guide further investigation that may reveal poor Present results to user for evaluation;
governance. In other words, he is using the unstructured data to narrow the user keeps interesting results discards
scope of his analysis and increase his odds at finding an issue. Many the trivial
unstructured applications follow this same logic.
Iterate
Second, typically an analyst is not aware of what he is looking for when
looking at unstructured data until “he sees it first.” Further, one analysis can lead to another. In both cases, the
analyst requires more data to complete the investigation than was originally contemplated. There are two ways to
deal with this situation. Additional “rules” can be built into the various approaches used for data transformation.
Unfortunately, those tools require a linguistic understanding and valuable analysis time is expended while attempting
to extract the precise information needed. A better solution is to extract more data up front and utilize the analysis
tools to make the increased volume of data manageable. This is the essence of data mining in the structured world,
where the goal is to analyze large volumes of data to find interesting information. Unstructured data simply requires
the additional step of converting the content into structured data first. Fortunately, data mining tools are very
sophisticated at filtering, sorting, and grouping large volumes of data so that it is easy to manage by an end-user
user. This allows the user to pursue an “analytical discovery” process versus fully defining the problem up-front.
Store
With unstructured data greatly eclipsing structured in terms of data volumes, it is essential that any application is built
on top of a highly scalable analytical architecture. The third generation Clarabridge Content Mining Platform
architecture is designed using lessons learned over the last 15 years in the large, dynamic data warehouse space.
When dealing with data ranges between 100 gigabytes and 100’s of terabytes (or more), special approaches need to
be used at all phases of the data sourcing, transformation, and loading process.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 14 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Clarabridge employs “hooks” back to underlying source data within its data Capture Schema (pre-designed layout of
data tables and the relationship between the tables) that is specifically designed to serve as a repository for data
captured by the various sources connectors and transformations. Clarabridge is designed in an application-
independent manner so that it can hold any type of source unstructured text without being custom designed for each
application.
The Clarabridge Analysis Schema provides a data schema that can be used to perform a wide range of differing
types of analysis for wide variety of applications base on data extracted from unstructured text. It supports
commercially available analysis such as business intelligence, data mining, link analysis, mapping, visualization,
reporting, and statistical. Further, Clarabridge is designed in an open manner to support various types of analytical
applications.
A packaged ETL layer provides a mapping and loading routine to automatically
migrate data and meta data from the Capture Schema to the Analysis Schema.
This is a general-purpose ETL layer for the two general-purpose schemas that it TYPES OF ANALYSIS TOOLS
moves data between. This ETL layer can integrated with existing structured
data and applications using commercially available ETL tools. Business Intelligence (BI) Tools.
Technologies to enable raw data to be
The platform has the ability to update data as required, including: incremental,
transformed into valuable information
real-time, streaming, or batch updates. Further, it allows any preferred database
for analysis and decision making.
management system (DBMS) to be utilized.
Features usually include dashboards,
reports, ad-hoc analysis, and OLAP
Analyze analysis.
The Clarabridge third generation solution enables users to rapidly apply proven Data Mining Tools. Tools used for
analysis techniques and tools to unstructured data, effectively asking “Any pattern detection, anomaly detection,
Question”, utilizing “Any Analysis Technique” against “Any Data Source.” To and data prediction against large sets
accomplish this it enables structured tools to connect via connectors and of numerical data.
perform analysis on the transformed data to allow for structured analysis of
initially unstructured data. Data Visualization, Link Analysis,
and Mapping Tools. Tools used for
Analysis tools can access the Analysis Schema using a standard web services visually describing, presenting, and
approach, so that structured analysis tools can analyze the results of analyzing data such as connections
transformations applied to unstructured data. It also allows data to be joined to between various people, events, and
other existing structured data that may, for example, reside in a data places.
warehouse. By allowing the analysis of structured data and unstructured data
together, new insights and findings can be found that would not be possible Statistical Analysis Tools. Tools
from structured data alone. useful for the collection, analysis,
interpretation and presentation of
The Clarabridge Content Mining Platform enables various structured data masses of numerical data. Statistical
analysis tools to analyze the data present in the analysis schema. Analysis tools calculations, such as linear
may include search, visualization, BI, data mining, mapping, link analysis, and regressions, variances, and means are
statistical analysis technologies. A key principle is that a user should not have to typically applied to the data.
select the tool based on the type of data he is analyzing. Rather, he or she
should be free to select the tool based on the specifics of the problem or task at Search Tools. Technology used to
hand. rapidly query large volumes of data,
usually unstructured, and quickly return
Clarabridge has the capability to pre-populate the metadata of the analysis tool relevant documents.
utilizing the tables, columns, attributes, facts, and metrics in the analysis
© 2006 Clarabridge, Inc. All Rights Reserved. Page 15 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
schema. Further, report templates are available out of the box for solutions areas, such as CRM and Investigation for
leading BI tools. This allows users to immediately begin analyzing the data present in the Analysis Schema without
performing tool customization or any application specific setup.
Users are now free to apply proven analytical approaches, which were previously impossible to perform against
unstructured data such as:
• Multi-dimensional analysis. Slicing or filtering data according to various dimensions, such as time or location
• Time-series analysis. Tracking how things have changed over time or determining the evolution of concepts
• Ranking analysis. Focusing on most critical items by ranking the top-10, bottom-10, etc.
• Market-basket analysis. Identifying what types of things typically are found with others or finding unexpected
relationships between people, places, or objects
• Anomaly analysis. Determining what events are unusual when compared with others or what items
unexpectedly disappear
Clarabridge provides the ability to drill through to the original unstructured source document. This allows an analyst to
completely understand the genesis of any result that they see in the structured analysis tool, to know exactly where
the data came from and how it was calculated, and to be able to drill all the way back to the original document or
documents to confirm and validate any element of the resulting structured analysis. And since the confidence
information is propagated through the system as described previously, analysts understand how reliable certain
metrics are. This provides for quality level context while analyzing data generated by Clarabridge.
Manage
To achieve the lowest total cost of ownership, the Content Mining Platform is easy to manage. Some of the keys to
manageability include:
• Eliminating custom coding and application customization. As you add more
sources, Clarabridge can be easily (or automatically) reconfigured to Clarabridge provides the
accommodate those sources.
ability to drill through to the
• Open platform. To ensure extensibility, the Clarabridge platform is open and
standards based, using a service-oriented architecture (SOA), and is built with original unstructured source
modern J2EE technology. It also leverages emerging standards, such as the IBM document. This allows an
Unstructured Information Management Architecture (UIMA).
analyst to completely
• Scalable architecture. The platform is designed in a multi-threaded, grid-friendly
distributed manner to allow for the parallel processing of extremely large amounts understand the genesis of any
of data through the system on a continuous real-time high throughput basis. result that they see in the
• Scheduling capabilities. Applications can be executed without human structured analysis tool.
intervention.
• Intuitive Clarabridge Analyst interface. Clarabridge leverages a patent-pending workflow paradigm to
integrate the various transformation approaches, utilizing a graphical user interface that requires minimal coding
or understanding of the underlying transformation technologies for a system administrator. This means it is
straightforward to modify an application over time.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 16 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
In addition to the above, the Clarabridge Content Mining Platform supports data lineage – or the analytical “path” that
led to a certain conclusion. Data lineage is implemented by capturing, in the solution, all the necessary information
required to help an analyst understand the sources of data presented in an analysis, the transformations performed
on an original data element (such as entity extraction, data quality processes, matching processes), the confidence
factor of transformations, and the dates and times of all processing steps. This information can be critical to the
proper understanding of the information being analyzed, and can help analysts “trace” reporting insights back to the
original systems of record.
The platform contains metadata, or “data about the data” permitting the presentation of data lineage insights for all
data that is processed. Because the metadata is built in, implementing data lineage functionality does not require any
special application development or configuration – it is a natural byproduct of the application creation process.
Benefits of This Approach
The benefits of the above approach are many and include broader analysis capabilities, faster ROI, and rapid time-to-
value.
Broader Analysis Capabilities
By having all access to all available data in an analytical framework, users spend more time analyzing and less time
retrieving and piecing together bits and pieces of unstructured information. They are free to apply proven analysis
approaches, such as anomaly analysis, to virtually unlimited data to detect trends, issues, and opportunities revealed
by their unstructured data using proven analytical approaches. Further, users enjoy a holistic view of their information
assets to enable analytical discovery across multiple problem domains, data types, and source systems.
This enables enterprises to create entirely new business applications to better serve customers, control cost and risk,
compete effectively, and drive profitability as demonstrated by the following simple examples:
Industry: Insurance, Application: Claims Fraud Detection
“What types of claims text, descriptions, comments, incidents, damage reports, scenarios, etc. indicate potential
anomalous or fraudulent claims that may merit further investigation?”
Data sources: Call center notes, claims historical archives, claims case files, structured claims and cost data
Industry: Insurance/Telecom/Financial, Application: Customer Retention
“How do particular patterns of communication over time between my company and its customers lead to either
the retention or loss of customers?
Data sources: Call center notes, emails, claims case files, structured claims and customer service data
Industry: Healthcare, Application: Drug Efficacy and Side Effects
“What symptoms disappear over time most often for patients prescribed Prozac for more than 24 months? What
new conditions or symptoms tend to appear most often with long-term use?”
Data sources: Doctor visit notes, hospital visit notes, claims history
Finally, managing the metadata about the sourced and transformed data, allows an analyst to completely understand
the genesis of any result that they see in the structured analysis tool. They know exactly where the data came from
© 2006 Clarabridge, Inc. All Rights Reserved. Page 17 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
and how it was calculated, and they can drill all the way back to the original document or documents to confirm and
validate any element of the resulting structured analysis.
Faster ROI
Organizations have made significant investments in business intelligence and data warehousing solutions. Many
have implemented multiple tools across their various divisions and functional areas. It makes little sense to deploy
another analytical tool into that environment. Those same organizations have spent hundreds of thousands, if not
millions, acquiring and training the staff necessary to run and utilize those tools. By leveraging a Content Mining
Platform organizations are able to rapidly extend the value of existing investments and their trained staff rather than
deploying new tools and analysis approaches.
The ability to extend the solution over time and swap in and out various components without any custom coding
provides much simpler ongoing administration and ultimately lower total cost of ownership. Further, the scalable and
open architecture will support evolving needs and growing data volumes.
Finally, source connectors to a wide variety of source systems deliver fast,
standard, data access when creating a new application. This becomes VALUE FROM TEXT
increasingly important as the number of data sources inevitably expands,
making it increasingly difficult to maintain point extraction code.
Rapid Time-to-Value Financial Services. Leverage
customer interactions to optimize
Rather than “re-invent the wheel”, enterprises are able to rapidly integrate and product features
leverage leading structured tools, including: statistical analysis, reporting,
visualization, search, data mining. Further, best-of-breed text processing Healthcare. Leverage clinical notes to
technologies can be rapidly incorporated into the Content Mining Platform. This improve disease management
is important because the best unstructured and structured tools vary from
Insurance. Leverage claims text to
application to application and evolve over time.
detect fraudulent activities
The platform allows less up front work when developing an application for a
Manufacturing. Leverage warranty
number of reasons. First the various components are treated as a “black box”,
claims text to uncover liabilities and
which makes them easy to leverage without understanding their inherent
trends
complexity. Second, a great deal of the difficulty in creating a BI application is
related to the ETL process and generating the target analytical data model as Public Sector. Leverage public filings
described above. By creating a universal capture schema, pre-defined ETL to expose suspicious activity
process, and an analytical schema, much of this effort is eliminated every time a
new application is created. Third, since reporting meta-data is automatically Retail. Leverage customer return
generated, users can immediately begin analyzing the data without performing information to change product mix
tool customization or any application specific setup.
Telecommunications. Leverage call
Finally, integrating multiple transformation approaches using a configurable center notes to reduce customer churn
workflow delivers greater data quality and gets end-users quickly using
unstructured content to address real-world business problems, rather than
spending months trying to perfectly design a solution up-front.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 18 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Appendix A. Suggested Evaluation Checklist
The decision to invest in technology to support a business intelligence application is an important one for any
organization – this is especially true for those that involve unstructured data. As your prepare a request for proposal
(RFP) for your unstructured data analysis needs, some suggested criteria to include in your evaluation
Source Criteria
√ Connect to a broad variety of data sources (FTP servers, Web servers, email servers, file servers, scanned
and OCRed paper files) without custom coding
√ Manage the process of extracting data from various sources
√ Enable concurrent communication with source systems
√ Integrate with existing commercial technologies to extract meaning from text
√ Allow for custom-developed technologies to extract meaning from text
√ Abstract underlying sourcing technologies as a “black box”
√ Capture text, data, and metadata that are generated by sourcing and extraction services
√ Assemble common metadata format across all sources and transformations
√ Enable link back to original source document
Transform Criteria
√ Allow unstructured data to be captured and passed through one or more embedded, custom, open-source,
or commercial transformation components
√ Provide coordinated workflow process, so the results of one or more transformations may serve as an input
to downstream transformations
√ Provide integration to wide variety of data transformation technologies
√ Capture results of transformation in a database
√ Retain complete metadata and links back to the original source data
√ Allow analyst to drill into the genesis of an analytical result or metric
√ Enable semantic distillation of unstructured data
Store Criteria
√ Ensure architecture is highly scalable to 100s of terabytes or more of data
√ Utilize capture schema with hooks back to underlying source data
√ Capture schema is application-independent and does not require custom design for every new applications
√ Utilize analysis schema that supports commercially available analysis such as BI, data mining, link analysis,
mapping, visualization, reporting and statistical
√ Enable access to analysis schema via other applications (e.g., open)
√ Provide ETL to move data automatically from capture schema to analysis schema
√ Allow data to be updated as required, including: incremental, real-time, and batch.
√ Support any commercially available DBMS
© 2006 Clarabridge, Inc. All Rights Reserved. Page 19 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Analyze Criteria
√ Provide a web services framework for structured analysis tools to access and analyze data contained in the
analysis schema
√ Allow data to be easily joined with existing structured data sources
√ Provide direct access via a variety of commercially available analysis tools, such as BI, data mining, link
analysis, mapping, visualization, reporting and statistical, without custom coding
√ Automatically pre-populate the metadata of the analysis tool utilizing the tables, columns, attributes, facts,
and metrics contained in the analysis schema
√ Enable application of proven analysis techniques, such as multi-dimensional analysis, time-series analysis,
ranking analysis, market-basket analysis, and anomaly analysis
√ Provide the ability to drill through to the original unstructured source document
√ Join together many data points that are captured throughout the flow of data to create a weighted statically-
oriented calculation of the confidence that can be assigned to any point of data
Manage Criteria
√ Enable reconfiguring of platform as new data sources or transformations are added without custom coding
√ Support open standards, including a service-oriented architecture
√ Support multi-threaded, grid-friendly, distributed computing architecture
√ Enable scheduling of applications
√ Supports meta-data management throughout analytical lifecycle
√ Provide user-friendly administrative interface
© 2006 Clarabridge, Inc. All Rights Reserved. Page 20 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Appendix B: Technical Glossary
• Anaphora Resolution. Anaphora resolution refers to linking pronouns such as “his”, “her”, “their”, and “it” to the
correct people, places, or things mentioned earlier in a piece of text.
• Business Intelligence (BI) Tools. Technologies to enable raw data to be transformed into valuable information
for analysis and decision making. Features usually include dashboards, reports, ad-hoc analysis, and OLAP
analysis. Tools include Cognos, MicroStrategy, Business Objects, Actuate, and Pentaho.
• Data Lineage. The analytical “path” that led to a certain conclusion. This information can be critical to the proper
understanding of information being analyzed. Also refers to the path that a certain data element or value took all
the way from source(s), through various transformations, to the resulting analysis.
• Data Mining Tools. Tools used for pattern detection, anomaly detection, and data prediction against large sets
of numerical data. Example tool vendors in this area include Angoss, IBM Intelligent Miner, and SAS.
• Concept extraction. Also known as topic extraction, concept extraction involves understanding the underlying
concept that a document or section of a document is describing. Techniques, such as automatically applying
categorizations can be used for concept extraction.
• Categorization and topic extraction. This is the process of grouping various documents or entities within
documents into various buckets. Typically this is done with a dictionary, thesaurus, ontology, or taxonomy.
• Data Visualization and Mapping Tools. Tools used for visually describing, presenting, and analyzing data. For
example, a link analysis tool, such as I2, would be used to visually show the connections between various
people, events, and places. Mapping tools such as ESRI and MapInfo show spatial interrelations between data
on maps.
• Data Warehouse (DW). A data warehouse is a database that contains a record of an organizations’ past
transactional and operational information designed for efficient data analysis and reporting.
• Entity extraction. Determining all people, places, objects, etc. within a document. Tools include Inxight,
Aerotext, Lingpipe, GATE, and NetOwl.
• Extract, Transform, Load (ETL). The process whereby structured data is sourced from multiple data
repositories; transformed to allow it to be cleansed, merged with other data, or manipulated for analytical
purposes; and loaded into a data warehouse for analysis.
• Natural Language Processing (NLP). The understanding and manipulation of natural language to enable
computers to "understand" the meaning of written human languages.
• Online Analytical Processing (OLAP). Sometimes called dimensional analysis, OLAP, is the process of slicing
data by various dimensions (time, dollars, product line, etc.) to see summary and detailed data for decision
making.
• Ontology. An understanding of the “fundamental categories of things in the world.” For analysis purposes,
ontologies can help to group various entities to better understand the context of that entity within a broader
domain and how it relates to other entities. They can also be used to filter out certain types of entities that are
undesired for a particular analysis.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 21 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
• Relationship extraction. Also called event extraction, transaction extraction, and fact extraction, relationship
extraction is the process of determining how various entities relate to one another. For example, if “Person A
took a trip to visit Person B”, the relationship between the two entities is “A visited B.” Relationships can also
include attributes or facts about specific entities, such as the number, 35, refer to “dollars” in a particular
sentence. Tools include Attensity and Clearforest.
• Semantic Distillation. Essentially, this means that all available information is sourced from the content as
efficiently as possible using a number of pre-configured transformations. Although a certain threshold of quality is
required, quantity is typically desired over exacting quality for applications using semantic distillation.
• Statistical Analysis Tools. Tools useful for the collection, analysis, interpretation and presentation of masses of
numerical data. Statistical calculations, such as linear regressions, variances, and means are typically applied to
the data using tools such as SAS or SPSS to test various hypotheses about those data.
• Structured Data. Content which has structure that is easily interpreted by a machine, commonly in a database
or XML format.
• Taxonomy. A hierarchy of “things.” For example, a geographical taxonomy may include relationships between a
city, state, and country. This is useful for “drilling into” data from a higher level down to lower levels of detail.
• Text analysis application. An application used for extracting meaning from content. For example, part-of-
speech detection, grammatical parsing and named-entity recognition. There are many of these applications, and
they are all good at various functions within various information domains. However, when used independently,
they are not useful for analysis.
• Text processing. See text analysis application.
• Unstructured Data. Content which does not have a structure that is easily interpreted by a machine. Examples
of unstructured data may include audio, video and unstructured text such as the body of an email or word
processor document.
© 2006 Clarabridge, Inc. All Rights Reserved. Page 22 of 23
All other trademarks and logos are property of their respective owners.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Founded in 2005 by leading experts in the Business Intelligence (BI) industry and
backed by a premier venture capital investment partner, Clarabridge is an emerging
leader in helping private and public sector enterprises leverage unstructured content
to provide critical operational and strategic business insight. Unlike traditional
approaches that are inflexible, expensive, and time consuming, Clarabridge’s
patent-pending software uniquely combines the best of the structured and
unstructured analysis worlds, allowing enterprises to greatly extend the value of their
existing BI investments. Clarabridge is the only enterprise-class solution that rapidly
enables users to directly mine text alongside existing structured data, using
standard BI tools and analysis techniques, to address a host of real-world business
needs.
© 2006 Clarabridge, Inc. All Rights Reserved.
All other trademarks and logos are property of their respective owners.
11400 Commerce Park Drive, Suite 500
Reston, VA 20191
P: 703.663.2500 │ F: 703.269.1505
www.clarabridge.com
© 2006 Clarabridge, Inc. All Rights Reserved. Page 23 of 23
All other trademarks and logos are property of their respective owners.