Converging Text and BI:
The Case for a Content Mining Platform
Although enterprises commonly utilize business intelligence (BI) tools against structured data for analysis and decision making, leading organizations recognize that they must take a more holistic view of their information assets and find ways to creatively analyze the exponentially growing universe of unstructured content - contracts, press releases, filings, forms, call center notes, medical records, insurance claims, web content, emails, etc. This white paper describes how the Clarabridge Content Mining Platform™ avoids the pitfalls of previous approaches to unstructured analysis, and capitalizes on lessons learned from solving similar problems in the structured domain. A platform approach enables enterprises to efficiently and effectively source, transform, store, and analyze unstructured data alongside structured data – in a way that is easy to manage. The result is broader business understanding, the ability to leverage existing resources, and the freedom to rapidly apply the most appropriate decision
WHITE PAPER March 6, 2006
support interface.
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Executive Summary
It is becoming ever more important for today's agile enterprise to use the best available data to drive strategic and operational business decisions. Although most companies deploy business intelligence (BI) tools against structured data to answer a wide variety of questions, leading organizations are increasingly recognizing that they must take a more holistic view of their information assets. They find creative ways to analyze the exponentially growing universe of unstructured content – contracts, press releases, research papers, filings, call center notes, medical records, insurance claims, web content, emails, etc. This content when understood and analyzed alongside structured data provides business insight that enables organizations to better serve customers, control cost and risk, compete effectively, and drive profitability. Text processing technologies are rapidly maturing to enable concept/entity extraction, relationship tagging, and other paradigms to allow more structure to be applied to unstructured data. Search technologies are evolving to provide end users with better ways to retrieve text, but provide limited to no analytic insight, which makes the determination of precise answers to questions time consuming, tedious, and increasingly more difficult. Even more advanced implementations of text processing technologies require complex programming work and are, like search engine technologies, totally disconnected from timetested analysis approaches used in the BI world.
Unstructured Data
Search
BI, OLAP, Reporting
Knowledge Worker
Structured Data
Currently end users have separate interfaces for Structured and Unstructured data : Search for Unstructured , and BI for Structured .
•Call center notes •Warranty repair notes
How can we improve satisfaction?
•Customer demographics •Service ticket & outcome
What is root cause of problem?
•Clinical Notes
How do symptoms change over time?
•Patient records
So how can enterprises better enable users to Figure 1 – Currently structured and unstructured analysis are done in spend their days making informed decisions different ways and with different tools. versus gathering data? Fortunately, there is much to be learned from two decades of struggling with similar problems in the structured data world. We now know as needs change and evolve, organizations will require the flexibility to integrate the most appropriate text processing technologies to extract desired information. They must enable users to apply time-tested analytical approaches that can be modified or expanded upon as understanding of issues and opportunities emerges from the data itself. For example, a call center should be able to apply a multi-dimensional analysis (i.e., “slice and dice”) to call center logs and email text for assessing trends, root causes, and relationships between issues, people, time to resolution, etc. Organizations should have the infrastructure, storage, and user interfaces to process and efficiently explore large volumes of data. And they need to easily leverage their existing BI and data warehousing (DW) tools presently used only for structured data analyses, to analyze unstructured data alongside structured data. As organizations adopt analytical approaches to unstructured data, they will need to address a number of challenges: • • • Data comes from multiple unstructured repositories (file servers, document management systems, intranet sites, internet sites, database notes fields, etc.) Data in unstructured documents is of widely varying quality (often much more so than structured) The use of different types of unstructured data tools varies greatly from environment to environment and from problem to problem.
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 2 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
•
In many cases maximum value in analyzing unstructured data comes from analyzing it alongside existing structured data in data marts or data warehouses
Fortunately, many of the challenges with unstructured data analytics can be overcome by applying lessons from the BI and DW sectors. Over the past 10 years departmental, point solutions of the early 1990’s rapidly evolved to more robust solutions that leveraged enterprise data warehousing platforms, an extract, transform, and load (ETL) infrastructure, and scalable, server-based BI or reporting solutions. To be successful in the unstructured world, organizations need a platform to leverage their existing BI investments and also efficiently and effectively source, transform, store, and analyze unstructured data – and do so in a way that is easy to manage and scale. That was the vision behind the Clarabridge Content Mining Platform™. The Clarabridge Content Mining Platform enables enterprises to: • • • • • Source. The Clarabridge platform connects to a variety of source systems and data types. Transform. Once Clarabridge sources the unstructured data, a variety of out-of-the box and third party tools help to ensure it is understood, merged, and integrated with other structured and unstructured data sources. Store. As unstructured data volumes explode, Clarabridge responds with a highly scalable architecture that utilizes proven data warehousing techniques and platforms. Analyze. End-users are able to efficiently analyze large volumes of data, using whatever analytical technique or tool they feel is appropriate for the problem at hand. Manage. As the application evolves and grows, the IT organization does not have to maintain lots of custom coding or extensions with Clarabridge. Further, the architecture scales and integrates into existing efforts.
Using the Clarabridge Content Mining Platform enables users to directly mine text alongside existing structured data, using standard BI tools and analysis techniques, to address a host of real-world business needs. The benefits are enormous and include: • Broader analysis capabilities. Users spend more time analyzing and less time retrieving text. They are free to apply proven analysis approaches to virtually unlimited data to detect trends, issues, and opportunities revealed by their unstructured data. Further, users enjoy a holistic view of their information assets to enable analytical discovery across multiple problem domains, data types, and source systems. Faster ROI. Organizations have made significant investments in BI and DW solutions. Customers are able to rapidly extend the value of those investments and their trained staff rather than deploying new tools and analysis approaches. Rapid time-to-value. Rather than “re-invent the wheel”, organizations are able to rapidly integrate and leverage leading unstructured and structured tools, including: statistical analysis, reporting, visualization, search, data mining, and text processing engines. Leveraging a platform allows less up-front work when developing an application and delivers fast data access, data quality, and time to analysis.
•
•
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 3 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Tapping Unstructured Data to Drive Business Value
Organizations today are buried in unstructured content such as contracts, press releases, research papers, forms, filings, call center notes, medical records, insurance claims, web content, emails, etc. Experts agree this content represents more than 80% of an organization’s data. And the amount is growing every day. Furthermore, in an increasingly services-based economy, unstructured transcripts, notes and documents describing business activity provide important insights about customer’s habits, tastes, product use and support requirements, employee work habits and performance, and business process efficiencies and failures. It is becoming ever more important for today's agile enterprise to utilize the best available data to drive strategic and operational business decisions. Although most companies utilize business intelligence (BI) tools against structured data to answer a wide variety of questions, leading organizations are increasingly recognizing that they must find ways to creatively analyze the exponentially growing universe of unstructured content. Unfortunately, the structured and unstructured data analysis domains have traditionally been separated along a number of dimensions including analysis approaches, storage, and staff. Typically analysis of unstructured data involves using a search tool to find documents containing information you are looking for, whereas structured data involves using BI or data mining tools to report on performance indicators, trends, changes over time or other quantitative metrics of business activity. Unstructured data is typically stored in file-based servers (such as web servers, document management servers, etc), while structured data is almost always stored in relational database management systems (RDBMS). Lastly, staff trained in BI are typically not skilled in the linguistic and other specialized techniques required “Unstructured” Information “Structured” Information for analyzing unstructured content and thus rarely use the tools and technologies associated with unstructured data analysis. What Web Content/ Data is needed is a way to converge the Documents Data Marts Intranets Warehouses two domains, leveraging the best from both, to unlock the true CRM, ERP, etc. Spreadsheets Emails potential of unstructured data. Systems When understood and analyzed alongside structured data, unstructured data provides business insight that enables organizations to better serve customers, control cost and risk, and identify opportunities for increased efficiency. Example use scenarios include: •
Web Content Paper Files Operational Data Stores
Metadata
Figure 2 – Unstructured and Structured information have traditionally been separated along a number of dimensions.
Automatically identifying top issues in (unstructured) call center logs and proactively routing calls to the right person based on the issue can save millions through reduced call time, not to mention improved customer service. Identifying and addressing the top types of problems encountered by the most profitable customers can help reduce loyal customer churn.
•
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 4 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
• • • •
Rapidly detecting emerging product trends in problem-reports coming in from all over the globe can avoid recalls, lawsuits, potentially saving companies millions of dollars. Analyzing patient comments, doctor notes, and symptom data can lead to better disease management and identification of new uses for drugs. Capitalizing on customer feedback following a product launch can help adjust marketing campaigns months ahead of competitors. Reducing hundreds of boxes of documents down to the two that are relevant as part of the legal discovery process reveals previously hidden information in less time than if all documents were read by human beings, and focuses critical resources on higher value tasks. Analyzing communications patterns, claims data and patient records to identify insurance fraud can significantly reduce fraudulent claims. Automatically mining thousands of SEC reports to predict poor corporate governance can help identify issues before they turn into major crises.
• •
The Evolution of Unstructured Analytics
If the potential is so great, why aren’t more organizations employing unstructured analytics? In large part the underlying technologies for unstructured analytics have only recently matured to support the types of analysis suggested above. Text processing technology has progressed from first generation keyword search to second generation point text analysis applications. Reviewing these first two generations of solutions, we begin to see the potential of unstructured data, but we also see that these technologies are inadequate for true unstructured analytics. This has driven the need for a third generation solution: a Content Mining Platform. With a Content Mining Platform, organizations can finally unlock the value of all their information assets, revealing a wealth of intelligence about their customers, competitors, With a Content Mining Platform, suppliers, and internal operations.
organizations will finally unlock the
First Generation Solutions: Search
The first generation text analysis technologies performed “search” – providing keyword search to help a user find documents containing the searched for words and concepts described by the keywords. While great for retrieving and grouping keywords within documents, these tools have many well-known problems that make them impractical to use for unstructured analytics: •
value of all their information assets, revealing a wealth of intelligence about their customers, competitors, suppliers, and internal operations.
Inability to store and quantify changes over time. They are unable to easily integrate with databases to efficiently store results and changes over time, and thus are unable to track or quantify the evolution of ideas, or the changes in activity levels of tracked people, processes, or organizations that may be searched. Simple interface. Search tools were designed to be easy to use, which restricts their analysis capability to simple Boolean (not/and/or) expressions. Inability to extract meaning. Although great at rapidly returning documents, a user still must take the time to read through the returned documents to extract meaning from them.
• •
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 5 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
•
Use keyword matching, not semantic understanding. A search tool relies on the user to identify the right combination of keywords to extract the desired information. Unfortunately, keywords can have different meanings in different contexts, which results in many irrelevant responses while at the same time excludes relevant ones that don’t happen to contain the original keywords.
As a result of these limitations, search typically requires a great deal of effort on the part of the user to manually sift through documents and connect bits and pieces of information to make decisions from unstructured data. Further, there is no way to analyze large volumes of unstructured search results alongside structured data. For example, many law firms hire paralegals or junior lawyers to manually sift through documents using search interfaces during a discovery process to tag those that are relevant. Although suitable for some applications, first generation search technology is time consuming and imprecise when used for complex business decision making. And it can be very expensive as highly compensated individuals are used for lower-level tasks like reading through documents to look for routine information.
Second Generation Solutions: Point Text Analysis Applications
The limitations of the keyword search applications led to a second generation technology, point text analysis applications. Many tools exist to solve a variety of problems related to understanding the true meaning of a document. These tools can scan a text document, for example, and pull out chemical names and their interactions, or identify events, locations, products, opinions about products, problems, methods, etc. Vendors may call their products “Entity Extraction, “Concept Extraction”, or “Name Matching” products. The technologies all tend to be stove-piped in that they solve a specific problem or work in a specific functional domain. While these tools perform the valuable task of helping to resolve documents to a more granular level – for example identifying and linking actors and events with each other by intelligently parsing and organizing the concepts contained in a document – these tools still present a number of challenges when organizations try to use them for analysis and decision making, including: • Stove-piped. A solution involving such technologies may not be integrated with standard document management, database management, data migration, metadata management, and data analysis tools. Further, any integration to these platforms can require custom programming to create enterprise application integration (EAI) solution. Offer varying degrees of insights. Each tool tends to be good at specific functions, and at extracting specific types of information, but not others. One tool may be very good at extracting names, while another may be good at extracting events, but no one tool can perform both tasks well. To get the full value of all tools a customer is faced with the task of again manually integrating the products to each other. Questionable scalability. These tools are unable to easily integrate with database to efficiently store results and changes over time, and many are unable to handle volumes of data in the 50GB – 100TB range Different types of users. Users must have a linguistic background or have specialized training that doesn’t commonly exist in an organization.
…the problem with these second generation approaches is that they require a precise understanding of the problem up-front and an enterprise architecture that never changes.
•
•
•
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 6 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
•
Require manual rule building up-front. Many tools require entity extraction rules to be defined up-front about the text being analyzed which presumes that the problem is fully defined up-front – which is rarely the case. Further, this means that it takes months to even begin an analysis while rules are defined, coded, and refined. Incomplete analysis. Because each tool is meant for a specific purpose, there are limits to the types of analysis that can be performed. Once each analysis is complete, the whole system is often shut down and the data is thrown away. Re-invent the wheel. Although analysis interfaces surpass those of search, they lack the interface or analytical maturity of existing structured analysis tools, and are not easily compatible with market leading business intelligence, statistical, or visualization tools. Multi-vendor solutions require custom coding. Combining two or more solutions typically requires a great deal of custom development. For example, the FBI has notoriously had a difficult time combining the results of multiple extraction solutions into a single consolidated view of a situation. A recent CNN article reported, “The current program requires FBI personnel to manually enter, print, sign and scan their information into the investigative data warehouse."
•
•
•
In short, the problem with these second generation approaches is that they require a precise understanding of the problem up-front and an enterprise architecture that never changes. In the real world, this never happens. Evolving requirements ultimately drive the need for custom modifications and extensions, which are increasingly costly, less scalable, and less maintainable over time. Further, the approach requires training employees on new analysis techniques, who in many cases, have spent years with traditional BI tools and would prefer to leverage the strengths of existing tools against unstructured data. Not surprisingly, second generation applications are analogous to point BI applications of the early 90s, as described in the next section. Although it is possible to perform limited unstructured analytics with these point-solution tools, forward-looking organizations understand the pitfalls of being strictly locked into a point-application.
Lessons Learned from the BI and DW Worlds
We have learned quite a bit, from nearly two-decades of BI and DW experience, which can be applied to the unstructured analytics domain. AN “ETL” PARADIGM FOR UNSTRUCTURED ANALYTICS First, by looking at the history of data integration we know that enterprises will need an “ETL” paradigm for the data that will drive unstructured analytics. In the ‘early days’ of data warehousing of the early 1990s, organizations originally created “stove piped” or “point reporting” solutions that reported against a single data source under a welldefined set of business rules. In many organizations, for instance, financial reporting solutions were created by creating point applications to extract …in the unstructured world, we general ledger and accounts payables/receivables tables from a financial see an even more compelling system, applying business rules to the data, and loading the resulting information into a reporting database – with reporting applications created need for an “ETL” approach to against the reporting database. Similar applications were created for sales reporting, customer relationship analysis, inventory/supply chain reporting, enable analysis… and other business departments or functions. Most enterprises had dozens of these applications.
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 7 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
This approach became an issue for several reasons. First, as organizations evolved to cross-functional views of their enterprise to focus on business drivers, such as “customer intelligence” and “risk” or “costs”, they needed to merge data from multiple sources. Because modifying one application caused a downstream and upstream impact on all others, point solutions became too difficult to maintain and costs quickly spiraled out of control. Second, as organizations realized that the original problem they were trying to solve had changed, they inevitably needed to go back and add more business rules or extract more data. Clearly, maintaining custom code becomes very difficult. Finally, these two issues magnify one another, creating an exponentially growing problem. In the structured data realm, integrating stove-piped data to support aggregated and well supported decisions is not a new problem. Over the past 10+ years this problem has been a central focus of large organizations building data warehouses, and decision support, or “business intelligence” applications to analyze structured data. Many data warehouse technologies are now mature enough to support high-quality extract, transformation and loading (ETL), or “fusion” of information for analysis. Rather than re-coding, the ETL platform can be simply re-configured to accommodate new business rules and data sources. However, these technologies almost exclusively focus on the integration of structured data. Unstructured data requires additional pre-processing before it can be loaded into a repository for analysis – data must be extracted from documents, tagged by entity, concept, and relationship, cleansed and transformed to ensure data quality, and finally loaded into a repository so that analytical tools can be used against it. Complicating matters further, unstructured data tools and technologies are not well integrated, and often require custom integration with each other in order to create a robust solution. To perform robust processing and analysis of unstructured data, multiple products may be required to perform functions such as: • • • • • • • • • • Entity Extraction Concept Determination Industry specific thesaurus matching
…as companies start getting
Data quality (to resolve varying forms of the same word, typos, and “dirty data”) Name matching (to resolve identities, products, foreign word spellings) Statistical reporting Business Intelligence/data mining Link analysis Visualization and spatial imaging and mapping Ad-hoc query creation
insights from unstructured data, a wealth of new opportunities will emerge, making it nearly impossible to pre-determine the questions you will need to answer using your unstructured data.
To merge structured and unstructured information for analysis a solution must address integration, administration, and presentation of the data through a variety of technologies and processes. In short, in the unstructured world, we see an even more compelling need for an “ETL” approach to enable analysis because there are: • • • More data sources and types; More data transformations; and More types of analysis possible.
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 8 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
This creates a many-to-many-to-many situation between source, transformation, and targets respectively, which will make point-solutions very difficult to maintain over time. Further, as companies start getting insights from unstructured data, a wealth of new opportunities will emerge making it nearly impossible to pre-determine the questions you will need to answer using your unstructured data. Clearly an “ETL” paradigm is critical to deploy enterprise class unstructured analytics. LEARNING FROM ENTERPRISE BI DEPLOYMENTS Besides the need for an “ETL” paradigm, lessons learned from world class enterprise BI and DW implementations tell us that: • • • • Data volume / scalability will be important. Structured databases are commonly in the multi-terabyte sizes. Unstructured data, which is five times more prevalent, must also be stored in a highly scalable architecture. Users need flexibility to choose any analysis tools. We know from experience with structured analytics that users will want to analyze unstructured documents using popular BI, Search, or other analysis interfaces. A relational data warehouse is required to do real analytics. Relational data warehouses are highly scalable and have the most analytical flexibility, which will be as important for unstructured analytics as it is in structured. Analytical approaches applied to structured data also apply to unstructured content. Users will need the ability to utilize existing analytical approaches, such as: multi-dimensional analysis, time-series analysis, ranking analysis, market-basket analysis, and anomaly analysis. Leveraging best-of-breed is important. As is true in the structured world, a next generation unstructured analysis solution must integrate a variety of independently developed technologies. These must be integrated to perform a comprehensive analysis task and then their results must be funneled into systems that allow users to rapidly find and exploit the discovered knowledge, for example, search engines, databases and/or knowledge bases.
•
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 9 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Ushering in the Third Generation of Unstructured Analytics
We can see that a new approach is needed to unstructured analytics. This new approach is innovative, but fortunately the underlying technologies are proven. The Clarabridge Content Mining Platform is designed to avoid the pitfalls of the first two generations, capitalize on lessons learned, and be successful in the unstructured analysis future. Clarabridge effectively converges the text and business intelligence worlds. Clarabridge takes “ETL” a step further in actually providing a pre-packed analytical database as well as connectors to various analytical front ends. Specifically, Clarabridge employs a platform approach to efficiently and effectively source, transform, store, and analyze unstructured data – and do so in a way that is easy to manage as depicted in the below figure.
Figure 3 – Components of the Clarabridge Content Mining Platform.
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 10 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
•
Source. The Clarabridge platform connects to a variety of unstructured and structured source systems and data types, such as file servers, web servers, enterprise content management systems, document management systems, enterprise applications, and many other systems containing structured and unstructured data. Transform. Once data is sourced, it must be understood, merged, and integrated with other structured and unstructured content. Clarabridge performs a variety of transformations to unstructured data (i.e., concept extraction, natural language processing, data matching, table extraction, etc.), either out-of-the-box or through integration with other technologies. Store. As data volumes explode, the Clarabridge platform responds with a highly scalable architecture that utilizes proven data warehousing techniques and platforms. Further, it uses a pre-defined schema for staging and storing the data extracted from unstructured sources and includes a process to easily integrate transformed data into a structured data mart, or warehouse for analysis. Analyze. Using Clarabridge, end-users are able to efficiently analyze large volumes of data, using whatever analytical technique or tool they feel is appropriate for the problem at hand. The platform has connectors to a variety of analysis interfaces, such as BI tools, data mining, visualization, and statistical analysis tools. Manage. As the application evolves and grows, there is little need for the Information Technology (IT) shop to maintain lots of custom coding or extensions. Further, the architecture scales and integrates into existing efforts.
•
•
•
•
Clarabridge Content Mining Platform takes raw content and enables that content to be directly mined using any type of analytical interface. The following sections describe the key capabilities delivered through the Clarabridge Content Mining Platform to accomplish this result.
Source
There are many unstructured data sources and a primary objective of any business application will be to achieve fast data access to any of those sources, as well as a clear understanding of the original sources of that data as downstream analysis progresses. Examples of unstructured sources include text fields (e.g., BLOB fields) in databases, FTP servers, Web servers, e-mail servers, file servers, document management systems, knowledge management systems, enterprise search tool repositories, or scanned and OCRed paper files. Clarabridge contains Source Connectors that interface with various sources and repositories of unstructured data and manage the process of extracting data from these various sources. Concurrent communication with a wide variety of occurs via APIs, web services, or other methods. This allows the sources to be treated as a “black box” by the rest of the process components.
Examples of unstructured sources include text fields (e.g., BLOB fields) in databases, FTP servers, Web servers, e-mail servers, file servers, document management systems, knowledge management systems, enterprise search tool repositories, or scanned and OCRed paper files.
The Content Mining Platform processes the text, data, and metadata returned from the Source Connectors. It then converts the various outputs from the various unstructured source systems into a consistent schema and format. At the same time, the various pieces of metadata that are also extracted from the source systems are assembled into a common metadata format. The Clarabridge Extraction Connectors assign a unique index key to each extracted source document, which allows it to be consistently traced as it moves through the rest of the system. This key, and the associated metadata stored regarding the source location of the text, also provides a mechanism to link back to the original text when desired during the course of analysis.
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 11 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Transform
Once Clarabridge sources the unstructured data, its next function is to extract meaning, thereby transforming the raw content into rich, structured data. This process is termed Semantic Extraction. Next, Clarabridge applies a series of data quality and staging processes to ensure that data is of high quality and ready for analysis. Accomplishing both requires a layer of transformation components as well as the ability to consolidate all structured and semi-structured data, metadata and other findings into a single repository. SEMANTIC EXTRACTION There are all kinds of technologies being developed in industry and academia for semantic extraction of content. Some, for example, specialize in part-of-speech detection, grammatical parsing and named-entity recognition where proper names, organizations and locations are identified – usually when combined with a dictionary or thesaurus. Other technologies may specialize in detecting events and times and then others work on detecting relationships between these elements. Still others are good at extracting objects from content, such as logs, signatures, or tables. The Clarabridge third generation solution has the flexibility to extract meaning from any content through embedded functionally, integration with existing commercial or open source technologies (such as the second generation point analysis tools), or through custom-developed components that are plugged-in. As technologies become obsolete or irrelevant, it is easy to swap out older transformation components in favor of better ones without requiring new coding or breaking the rest of the application components. Various Clarabridge Transformation Components provide a variety of value-added semantic extraction capability as demonstrated by the following components. Note that Clarabridge provides all of the below transformation functionality natively within the application. However, for specific applications, it may be necessary to leverage the second generation point analysis applications described previously from within the Clarabridge environment. For those situations, Clarabridge has built-in connector to all of the best-of-breed text processing and transformation components. • Document segmentation and categorization. Transformations that are applied at the document level. For example a process groups documents into various buckets such as “violent events” or “press release”, or “financial transaction” through statistical analysis of the underlying content. Another process can automatically segment the document to identify various headers, sections, or objects within the document, such as signatures or logos using rules or other semantic approaches. Entity extraction. Functionality to determine all people, places, dates, financial amounts, objects, etc. within a portion of text. Basically this involves identifying and categorizing the various proper nouns and objects mentioned in the text. Event, relationship, & fact extraction. The process of determining how various entities related to one another. For example, if “Person A took a trip to visit Person B”, the relationship between the two entities is “A visited B.” Relationships can also include attributes or a fact about specific entities, such as the number, 35, refers to
•
•
Figure 4 – Example transformation workflow
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 12 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
“dollars” in a particular sentence. • Table data extraction. Takes any type of text table that is readable by a human and can convert the table into structured rows, columns, cells, headers, and multiplier representation that can then be used for further structured analysis. Image extraction. Takes any embedded object, such as a logo, photo, or signature, within a document and tries to interpret that object, typically by matching it to a database of possible objects.
•
DATA QUALITY AND STAGING A final, but critical transformation step involves ensuring that the data has the proper quality and is enriched for analysis. The following capabilities are provided within the Clarabridge Content Mining Platform, although third party data quality components can also be integrated into the platform: • • Data Matching. Achieves better data quality by applying various cleansing techniques to resolve, for example, different spellings of a single entity across two different documents. Data Merging. This involves combining like data from structured and unstructured data sources or enriching data by associating additional attributes about that data from across documents. Dimensions / Hierarchy. Organizes data along relevant dimensions, such as time or geography, to allow data to be “sliced and diced” along a number of dimensions. This may also include the integration of ontologies, dictionaries, taxonomies, or thesauri, which are used to organize data into various hierarchies and relationships for further analysis.
Since unstructured data is often imprecise, the ability to understand the confidence level of any findings is critical.
•
•
Confidence Level. Since unstructured data is often imprecise, the ability to understand the confidence level of any finding is critical. Employing confidence analysis into the platform allows users to not only see and analyze data within structured analysis tools, but also to calculate a numeric confidence level for each data element or aggregate data calculation. The platform joins many data points that are captured throughout the flow of data to create a weighted statically-oriented calculation of the confidence that can be assigned to any point of data.
TRANSFORMATION WORKFLOW Clarabridge’s third generation solution includes a transformation workflow engine which manages the process of taking the collected unstructured data and passing it through one or more transformation components. Transformations are run in a coordinated process, as the results of one or more transformations may serve as an input to downstream transformations. Further, Clarabridge provides a common API to a wide variety of custom, open source, or third party unstructured data transformation technologies so each of those transformations can operate as a “black box” abstraction from the rest of the system. Clarabridge retains complete metadata and links back to the original source data, which allows end users to trace back through the transformation that took place and from there back to the original source of unstructured data. As an example, assume a regulatory organization wanted to mine SEC filings to identify related party transactions that may indicate fraudulent activity. Using an FTP source connector to Edgar Online, the thousands of pages of publicly available SEC filings could be accessed via the platform. Once loaded into the Capture Schema, those documents could be run through a transformation workflow to extract sections, headers, tables, and related party transaction information from the documents. Each of these transformations requires different technologies and configurations, but data and metadata are managed through the transformation workflow process to allow those
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 13 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
components to be accessed without coding or understanding their technical underpinnings. Since the resulting data is in a database, it is easily merged with structured financial data. Using the analysis components described in the next section, investigators could easily see trends, statistical anomalies, and “slice and dice” companies, industries, and transaction types to root out suspicious behavior. SEMANTIC DISTILLATION: WHAT IT IS AND WHY IT IS IMPORTANT For many applications involving unstructured data analysis, it is important to perform a “semantic distillation” of the relevant source information. Essentially, this means that all available information is sourced from the content as efficiently as possible using a number of pre-configured transformations. Although a certain threshold of quality is required, quantity is typically desired over exacting quality for these applications. This makes sense for two reasons. First, for many applications, finding a precise answer to a question is not as important as finding anomalies over large data sets. Continuing on the above example, if an investigator is reviewing SEC filings to detect clues related to poor corporate governance, he would be interested in reviewing all related party transactions identified across those filings. He might be interested in seeing which seem unusual when compared to industry mean activity. In this situation, the investigator is not actually looking for a specific example of poor governance. He is looking through large volumes of data to find unusual patterns that could guide further investigation that may reveal poor governance. In other words, he is using the unstructured data to narrow the scope of his analysis and increase his odds at finding an issue. Many unstructured applications follow this same logic.
THE ESSENCE OF DATA MINING
Examine many combinations of parameters/variables; not just “obvious” ones Churn through millions of calculations searching for patterns, relationships, anomalies Apply multiple algorithms: linear regression, trees, neural networks, graphs Present results to user for evaluation; user keeps interesting results discards the trivial Iterate
Second, typically an analyst is not aware of what he is looking for when looking at unstructured data until “he sees it first.” Further, one analysis can lead to another. In both cases, the analyst requires more data to complete the investigation than was originally contemplated. There are two ways to deal with this situation. Additional “rules” can be built into the various approaches used for data transformation. Unfortunately, those tools require a linguistic understanding and valuable analysis time is expended while attempting to extract the precise information needed. A better solution is to extract more data up front and utilize the analysis tools to make the increased volume of data manageable. This is the essence of data mining in the structured world, where the goal is to analyze large volumes of data to find interesting information. Unstructured data simply requires the additional step of converting the content into structured data first. Fortunately, data mining tools are very sophisticated at filtering, sorting, and grouping large volumes of data so that it is easy to manage by an end-user user. This allows the user to pursue an “analytical discovery” process versus fully defining the problem up-front.
Store
With unstructured data greatly eclipsing structured in terms of data volumes, it is essential that any application is built on top of a highly scalable analytical architecture. The third generation Clarabridge Content Mining Platform architecture is designed using lessons learned over the last 15 years in the large, dynamic data warehouse space. When dealing with data ranges between 100 gigabytes and 100’s of terabytes (or more), special approaches need to be used at all phases of the data sourcing, transformation, and loading process.
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 14 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Clarabridge employs “hooks” back to underlying source data within its data Capture Schema (pre-designed layout of data tables and the relationship between the tables) that is specifically designed to serve as a repository for data captured by the various sources connectors and transformations. Clarabridge is designed in an applicationindependent manner so that it can hold any type of source unstructured text without being custom designed for each application. The Clarabridge Analysis Schema provides a data schema that can be used to perform a wide range of differing types of analysis for wide variety of applications base on data extracted from unstructured text. It supports commercially available analysis such as business intelligence, data mining, link analysis, mapping, visualization, reporting, and statistical. Further, Clarabridge is designed in an open manner to support various types of analytical applications. A packaged ETL layer provides a mapping and loading routine to automatically migrate data and meta data from the Capture Schema to the Analysis Schema. This is a general-purpose ETL layer for the two general-purpose schemas that it moves data between. This ETL layer can integrated with existing structured data and applications using commercially available ETL tools. The platform has the ability to update data as required, including: incremental, real-time, streaming, or batch updates. Further, it allows any preferred database management system (DBMS) to be utilized.
TYPES OF ANALYSIS TOOLS
Business Intelligence (BI) Tools. Technologies to enable raw data to be transformed into valuable information for analysis and decision making. Features usually include dashboards, reports, ad-hoc analysis, and OLAP analysis. Data Mining Tools. Tools used for pattern detection, anomaly detection, and data prediction against large sets of numerical data. Data Visualization, Link Analysis, and Mapping Tools. Tools used for visually describing, presenting, and analyzing data such as connections between various people, events, and places. Statistical Analysis Tools. Tools useful for the collection, analysis, interpretation and presentation of masses of numerical data. Statistical calculations, such as linear regressions, variances, and means are typically applied to the data. Search Tools. Technology used to rapidly query large volumes of data, usually unstructured, and quickly return relevant documents.
Analyze
The Clarabridge third generation solution enables users to rapidly apply proven analysis techniques and tools to unstructured data, effectively asking “Any Question”, utilizing “Any Analysis Technique” against “Any Data Source.” To accomplish this it enables structured tools to connect via connectors and perform analysis on the transformed data to allow for structured analysis of initially unstructured data. Analysis tools can access the Analysis Schema using a standard web services approach, so that structured analysis tools can analyze the results of transformations applied to unstructured data. It also allows data to be joined to other existing structured data that may, for example, reside in a data warehouse. By allowing the analysis of structured data and unstructured data together, new insights and findings can be found that would not be possible from structured data alone. The Clarabridge Content Mining Platform enables various structured data analysis tools to analyze the data present in the analysis schema. Analysis tools may include search, visualization, BI, data mining, mapping, link analysis, and statistical analysis technologies. A key principle is that a user should not have to select the tool based on the type of data he is analyzing. Rather, he or she should be free to select the tool based on the specifics of the problem or task at hand. Clarabridge has the capability to pre-populate the metadata of the analysis tool utilizing the tables, columns, attributes, facts, and metrics in the analysis
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 15 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
schema. Further, report templates are available out of the box for solutions areas, such as CRM and Investigation for leading BI tools. This allows users to immediately begin analyzing the data present in the Analysis Schema without performing tool customization or any application specific setup. Users are now free to apply proven analytical approaches, which were previously impossible to perform against unstructured data such as: • • • • • Multi-dimensional analysis. Slicing or filtering data according to various dimensions, such as time or location Time-series analysis. Tracking how things have changed over time or determining the evolution of concepts Ranking analysis. Focusing on most critical items by ranking the top-10, bottom-10, etc. Market-basket analysis. Identifying what types of things typically are found with others or finding unexpected relationships between people, places, or objects Anomaly analysis. Determining what events are unusual when compared with others or what items unexpectedly disappear
Clarabridge provides the ability to drill through to the original unstructured source document. This allows an analyst to completely understand the genesis of any result that they see in the structured analysis tool, to know exactly where the data came from and how it was calculated, and to be able to drill all the way back to the original document or documents to confirm and validate any element of the resulting structured analysis. And since the confidence information is propagated through the system as described previously, analysts understand how reliable certain metrics are. This provides for quality level context while analyzing data generated by Clarabridge.
Manage
To achieve the lowest total cost of ownership, the Content Mining Platform is easy to manage. Some of the keys to manageability include: • Eliminating custom coding and application customization. As you add more sources, Clarabridge can be easily (or automatically) reconfigured to accommodate those sources. Open platform. To ensure extensibility, the Clarabridge platform is open and standards based, using a service-oriented architecture (SOA), and is built with modern J2EE technology. It also leverages emerging standards, such as the IBM Unstructured Information Management Architecture (UIMA). Scalable architecture. The platform is designed in a multi-threaded, grid-friendly distributed manner to allow for the parallel processing of extremely large amounts of data through the system on a continuous real-time high throughput basis. Scheduling capabilities. Applications can be executed without human intervention.
Clarabridge provides the ability to drill through to the original unstructured source document. This allows an analyst to completely understand the genesis of any result that they see in the structured analysis tool.
•
•
• •
Intuitive Clarabridge Analyst interface. Clarabridge leverages a patent-pending workflow paradigm to integrate the various transformation approaches, utilizing a graphical user interface that requires minimal coding or understanding of the underlying transformation technologies for a system administrator. This means it is straightforward to modify an application over time.
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 16 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
In addition to the above, the Clarabridge Content Mining Platform supports data lineage – or the analytical “path” that led to a certain conclusion. Data lineage is implemented by capturing, in the solution, all the necessary information required to help an analyst understand the sources of data presented in an analysis, the transformations performed on an original data element (such as entity extraction, data quality processes, matching processes), the confidence factor of transformations, and the dates and times of all processing steps. This information can be critical to the proper understanding of the information being analyzed, and can help analysts “trace” reporting insights back to the original systems of record. The platform contains metadata, or “data about the data” permitting the presentation of data lineage insights for all data that is processed. Because the metadata is built in, implementing data lineage functionality does not require any special application development or configuration – it is a natural byproduct of the application creation process.
Benefits of This Approach
The benefits of the above approach are many and include broader analysis capabilities, faster ROI, and rapid time-tovalue.
Broader Analysis Capabilities
By having all access to all available data in an analytical framework, users spend more time analyzing and less time retrieving and piecing together bits and pieces of unstructured information. They are free to apply proven analysis approaches, such as anomaly analysis, to virtually unlimited data to detect trends, issues, and opportunities revealed by their unstructured data using proven analytical approaches. Further, users enjoy a holistic view of their information assets to enable analytical discovery across multiple problem domains, data types, and source systems. This enables enterprises to create entirely new business applications to better serve customers, control cost and risk, compete effectively, and drive profitability as demonstrated by the following simple examples: Industry: Insurance, Application: Claims Fraud Detection “What types of claims text, descriptions, comments, incidents, damage reports, scenarios, etc. indicate potential anomalous or fraudulent claims that may merit further investigation?” Data sources: Call center notes, claims historical archives, claims case files, structured claims and cost data Industry: Insurance/Telecom/Financial, Application: Customer Retention “How do particular patterns of communication over time between my company and its customers lead to either the retention or loss of customers? Data sources: Call center notes, emails, claims case files, structured claims and customer service data Industry: Healthcare, Application: Drug Efficacy and Side Effects “What symptoms disappear over time most often for patients prescribed Prozac for more than 24 months? What new conditions or symptoms tend to appear most often with long-term use?” Data sources: Doctor visit notes, hospital visit notes, claims history Finally, managing the metadata about the sourced and transformed data, allows an analyst to completely understand the genesis of any result that they see in the structured analysis tool. They know exactly where the data came from
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 17 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
and how it was calculated, and they can drill all the way back to the original document or documents to confirm and validate any element of the resulting structured analysis.
Faster ROI
Organizations have made significant investments in business intelligence and data warehousing solutions. Many have implemented multiple tools across their various divisions and functional areas. It makes little sense to deploy another analytical tool into that environment. Those same organizations have spent hundreds of thousands, if not millions, acquiring and training the staff necessary to run and utilize those tools. By leveraging a Content Mining Platform organizations are able to rapidly extend the value of existing investments and their trained staff rather than deploying new tools and analysis approaches. The ability to extend the solution over time and swap in and out various components without any custom coding provides much simpler ongoing administration and ultimately lower total cost of ownership. Further, the scalable and open architecture will support evolving needs and growing data volumes. Finally, source connectors to a wide variety of source systems deliver fast, standard, data access when creating a new application. This becomes increasingly important as the number of data sources inevitably expands, making it increasingly difficult to maintain point extraction code.
VALUE FROM TEXT
Rapid Time-to-Value
Rather than “re-invent the wheel”, enterprises are able to rapidly integrate and leverage leading structured tools, including: statistical analysis, reporting, visualization, search, data mining. Further, best-of-breed text processing technologies can be rapidly incorporated into the Content Mining Platform. This is important because the best unstructured and structured tools vary from application to application and evolve over time. The platform allows less up front work when developing an application for a number of reasons. First the various components are treated as a “black box”, which makes them easy to leverage without understanding their inherent complexity. Second, a great deal of the difficulty in creating a BI application is related to the ETL process and generating the target analytical data model as described above. By creating a universal capture schema, pre-defined ETL process, and an analytical schema, much of this effort is eliminated every time a new application is created. Third, since reporting meta-data is automatically generated, users can immediately begin analyzing the data without performing tool customization or any application specific setup. Finally, integrating multiple transformation approaches using a configurable workflow delivers greater data quality and gets end-users quickly using unstructured content to address real-world business problems, rather than spending months trying to perfectly design a solution up-front.
Financial Services. Leverage customer interactions to optimize product features Healthcare. Leverage clinical notes to improve disease management Insurance. Leverage claims text to detect fraudulent activities Manufacturing. Leverage warranty claims text to uncover liabilities and trends Public Sector. Leverage public filings to expose suspicious activity Retail. Leverage customer return information to change product mix Telecommunications. Leverage call center notes to reduce customer churn
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 18 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Appendix A. Suggested Evaluation Checklist
The decision to invest in technology to support a business intelligence application is an important one for any organization – this is especially true for those that involve unstructured data. As your prepare a request for proposal (RFP) for your unstructured data analysis needs, some suggested criteria to include in your evaluation
Source Criteria √ √ √ √ √ √ √ √ √
Connect to a broad variety of data sources (FTP servers, Web servers, email servers, file servers, scanned and OCRed paper files) without custom coding Manage the process of extracting data from various sources Enable concurrent communication with source systems Integrate with existing commercial technologies to extract meaning from text Allow for custom-developed technologies to extract meaning from text Abstract underlying sourcing technologies as a “black box” Capture text, data, and metadata that are generated by sourcing and extraction services Assemble common metadata format across all sources and transformations Enable link back to original source document
Transform Criteria √ √ √ √ √ √ √
Allow unstructured data to be captured and passed through one or more embedded, custom, open-source, or commercial transformation components Provide coordinated workflow process, so the results of one or more transformations may serve as an input to downstream transformations Provide integration to wide variety of data transformation technologies Capture results of transformation in a database Retain complete metadata and links back to the original source data Allow analyst to drill into the genesis of an analytical result or metric Enable semantic distillation of unstructured data
Store Criteria √ √ √ √ √ √ √ √
Ensure architecture is highly scalable to 100s of terabytes or more of data Utilize capture schema with hooks back to underlying source data Capture schema is application-independent and does not require custom design for every new applications Utilize analysis schema that supports commercially available analysis such as BI, data mining, link analysis, mapping, visualization, reporting and statistical Enable access to analysis schema via other applications (e.g., open) Provide ETL to move data automatically from capture schema to analysis schema Allow data to be updated as required, including: incremental, real-time, and batch. Support any commercially available DBMS
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 19 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Analyze Criteria √ √ √ √ √ √ √
Provide a web services framework for structured analysis tools to access and analyze data contained in the analysis schema Allow data to be easily joined with existing structured data sources Provide direct access via a variety of commercially available analysis tools, such as BI, data mining, link analysis, mapping, visualization, reporting and statistical, without custom coding Automatically pre-populate the metadata of the analysis tool utilizing the tables, columns, attributes, facts, and metrics contained in the analysis schema Enable application of proven analysis techniques, such as multi-dimensional analysis, time-series analysis, ranking analysis, market-basket analysis, and anomaly analysis Provide the ability to drill through to the original unstructured source document Join together many data points that are captured throughout the flow of data to create a weighted staticallyoriented calculation of the confidence that can be assigned to any point of data
Manage Criteria √ √ √ √ √ √
Enable reconfiguring of platform as new data sources or transformations are added without custom coding Support open standards, including a service-oriented architecture Support multi-threaded, grid-friendly, distributed computing architecture Enable scheduling of applications Supports meta-data management throughout analytical lifecycle Provide user-friendly administrative interface
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 20 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Appendix B: Technical Glossary
• • Anaphora Resolution. Anaphora resolution refers to linking pronouns such as “his”, “her”, “their”, and “it” to the correct people, places, or things mentioned earlier in a piece of text. Business Intelligence (BI) Tools. Technologies to enable raw data to be transformed into valuable information for analysis and decision making. Features usually include dashboards, reports, ad-hoc analysis, and OLAP analysis. Tools include Cognos, MicroStrategy, Business Objects, Actuate, and Pentaho. Data Lineage. The analytical “path” that led to a certain conclusion. This information can be critical to the proper understanding of information being analyzed. Also refers to the path that a certain data element or value took all the way from source(s), through various transformations, to the resulting analysis. Data Mining Tools. Tools used for pattern detection, anomaly detection, and data prediction against large sets of numerical data. Example tool vendors in this area include Angoss, IBM Intelligent Miner, and SAS. Concept extraction. Also known as topic extraction, concept extraction involves understanding the underlying concept that a document or section of a document is describing. Techniques, such as automatically applying categorizations can be used for concept extraction. Categorization and topic extraction. This is the process of grouping various documents or entities within documents into various buckets. Typically this is done with a dictionary, thesaurus, ontology, or taxonomy. Data Visualization and Mapping Tools. Tools used for visually describing, presenting, and analyzing data. For example, a link analysis tool, such as I2, would be used to visually show the connections between various people, events, and places. Mapping tools such as ESRI and MapInfo show spatial interrelations between data on maps. Data Warehouse (DW). A data warehouse is a database that contains a record of an organizations’ past transactional and operational information designed for efficient data analysis and reporting. Entity extraction. Determining all people, places, objects, etc. within a document. Tools include Inxight, Aerotext, Lingpipe, GATE, and NetOwl. Extract, Transform, Load (ETL). The process whereby structured data is sourced from multiple data repositories; transformed to allow it to be cleansed, merged with other data, or manipulated for analytical purposes; and loaded into a data warehouse for analysis. Natural Language Processing (NLP). The understanding and manipulation of natural language to enable computers to "understand" the meaning of written human languages. Online Analytical Processing (OLAP). Sometimes called dimensional analysis, OLAP, is the process of slicing data by various dimensions (time, dollars, product line, etc.) to see summary and detailed data for decision making. Ontology. An understanding of the “fundamental categories of things in the world.” For analysis purposes, ontologies can help to group various entities to better understand the context of that entity within a broader domain and how it relates to other entities. They can also be used to filter out certain types of entities that are undesired for a particular analysis.
•
• •
• •
• • •
• •
•
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 21 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
•
Relationship extraction. Also called event extraction, transaction extraction, and fact extraction, relationship extraction is the process of determining how various entities relate to one another. For example, if “Person A took a trip to visit Person B”, the relationship between the two entities is “A visited B.” Relationships can also include attributes or facts about specific entities, such as the number, 35, refer to “dollars” in a particular sentence. Tools include Attensity and Clearforest. Semantic Distillation. Essentially, this means that all available information is sourced from the content as efficiently as possible using a number of pre-configured transformations. Although a certain threshold of quality is required, quantity is typically desired over exacting quality for applications using semantic distillation. Statistical Analysis Tools. Tools useful for the collection, analysis, interpretation and presentation of masses of numerical data. Statistical calculations, such as linear regressions, variances, and means are typically applied to the data using tools such as SAS or SPSS to test various hypotheses about those data. Structured Data. Content which has structure that is easily interpreted by a machine, commonly in a database or XML format. Taxonomy. A hierarchy of “things.” For example, a geographical taxonomy may include relationships between a city, state, and country. This is useful for “drilling into” data from a higher level down to lower levels of detail. Text analysis application. An application used for extracting meaning from content. For example, part-ofspeech detection, grammatical parsing and named-entity recognition. There are many of these applications, and they are all good at various functions within various information domains. However, when used independently, they are not useful for analysis. Text processing. See text analysis application. Unstructured Data. Content which does not have a structure that is easily interpreted by a machine. Examples of unstructured data may include audio, video and unstructured text such as the body of an email or word processor document.
•
•
• • •
• •
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 22 of 23
Converging Text and BI: the Case for an Unstructured Intelligence Platform
Founded in 2005 by leading experts in the Business Intelligence (BI) industry and backed by a premier venture capital investment partner, Clarabridge is an emerging leader in helping private and public sector enterprises leverage unstructured content to provide critical operational and strategic business insight. Unlike traditional approaches that are inflexible, expensive, and time consuming, Clarabridge’s patent-pending software uniquely combines the best of the structured and unstructured analysis worlds, allowing enterprises to greatly extend the value of their existing BI investments. Clarabridge is the only enterprise-class solution that rapidly enables users to directly mine text alongside existing structured data, using standard BI tools and analysis techniques, to address a host of real-world business needs. © 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
11400 Commerce Park Drive, Suite 500 Reston, VA 20191 P: 703.663.2500 │ F: 703.269.1505 www.clarabridge.com
© 2006 Clarabridge, Inc. All Rights Reserved. All other trademarks and logos are property of their respective owners.
Page 23 of 23