Data Mining by jianghongl

VIEWS: 24 PAGES: 116

									Database Modeling and

 Chapter 8 (Part D)

          Data Mining Basics
    Instructor: Paul Chen
1.    How Data Mining Evolved?
2.    Decision Processing Overview and Tasks
3.    Data Mining, What’s it?
4.    Data Mining vs. Data Warehousing
5.    How Data Mining Works? And Its Applications
6.    Data Mining Operations and Associated Techniques
7.    The Data Mining Process
8.    Data Mining Tools
9.    Data Mining Applications For CRM
10.   Data Mining From Government Printing Office
11.   Data Mining Techniques- A Summary
     Topic 1:How Data Mining Evolved?

Many businesses have invested heavily in information
technology to help them manage their businesses more
effectively and gain a competitive edge. Increasingly large
amounts of critical business data are being stored
electronically and this volume is expected to continue to
grow. The Data Mining technology is helping companies
leverage their existing data more effectively and obtain
insightful information giving them a competitive edge.
     How Data Mining Evolved?

  1960s                     1990s    Late 1990s to
  Data                    OLAP and       Now
              RDBMS                  Data Mining
Collection                   DW

              Time Line
      Topic 2: Decision Processing
   Decision processing systems, and their underlying
    analytical applications, provide business users with the
    information they need to track and analyze business
    trends, and to explore new business opportunities. As
    businesses become increasingly competitive and
    complex, effective decision processing systems are
    essential for success.
      The Next Generation of Business

   A decision processing system analyzes business
    information captured from operational systems (Back-
    and-front office, and e-business applications).
   Distribution of business information to business users is
    via corporate intranets and extranets.
   The flow of data can be thought of as an information
    supply chain whose objective is to convert operational
    data into useful business information.
        The Decision Processing Information
        Supply Chain       Business
                External      Analytic
 E-Business      Data        Applications

                                 DW                 &
Back-Office                                  Office Systems
Transaction                     Business
Applications                  Intelligence
               Information       Tools
Front-Office                                          Decisions
       Decision Processing—Four Tasks***

   Extracting and transforming information
    This involves capturing data from operational systems,
    transforming it into business information, and loading
    Into a data warehouse information store.

    Current extract templates on the market are primarily at
    Capturing data from ERP (Enterprise Resource Planning)
    Transaction processing systems –for example: SAP Business
    Information Warehouse and Peoplesoft BPM data warehouse)

    *** Mentioned in chapter 2
      Decision Processing—Four Tasks

Managing information

This task encompasses the maintenance of business information in
information stores, and how these information stores are processed by
business intelligence tools and analytic applications.
The cornerstone of decision processing is data warehousing, and
warehouse information stores should be organized and modeled into
relational and multidimensional database products.
      Decision Processing—Four Tasks

   Analyzing and modeling information
    The traditional approach to decision processing is to
    build a data warehouse and supply business users
    with a set of business intelligence tools (query,
    reporting, OLAP and data mining, for example) to
    process information in data warehouse information
    A better approach is employ turn-key and web-
    based analytic application packages that are
    designed to provide comprehensive analyses for the
    business area being researched. Key business
    metrics (ex. Revenue dollars per sales rep per day)
    are useful.
      Decision Processing—Four Tasks

   Distributing information

Business intelligence tools and analytic applications distribute information
and the results of analysis operations to business users via standard graphical
and Web interfaces.
To help users uncover and organize this range of business information, an
enterprise information portal (EIP) is required. An EIP provides a single
point of entry to any piece of business information, no matter where it
The main components of an EIP are information assistant (Web browser
interface) , an information directory and a subscription facility.
      Decision Making Under Risk

   Decisions are made under three sets of conditions:
      Certainty
         The decision makers know everything in advance
          of making the decision
      Uncertainty
         The decision makers know nothing about the
          probabilities or the consequences of decisions
      Risk
      Decision-Making Style

   Decision-making styles of users are categorized as
      Analytic or
      Heuristic
       Analytic and Heuristic Decision
   Analytical Decision Maker        Heuristic Decision Maker

   Learns by analyzing              Learns by acting
   Uses step-by-step procedure      Uses trial and error
   Values quantitative              Values experiences
    information and models           Relies on common sense
   Builds mathematical models       Seeks completely satisfying
    and algorithms                    solution
   Seeks optimal solution
      Topic 3: Data Mining, What’s it?

   Data Mining has been defined as “ a decision support
    process in which a search is made for patterns of
    information in data”. To detect patterns in data, Data
    Mining uses sophisticated statistical analysis and modeling
    technologies to uncover useful relationships hidden in
    databases. It predicts future trends and finds behavior
    allowing businesses to make predictive, knowledge-driven
      Data Mining, What’s it?

   The process of extracting valid, previously unknown,
    comprehensible, and actionable information from large
    databases and using it to make crucial business
    decisions, (Simoudis,1996).

   Involves analysis of data and use of software techniques
    for finding hidden and unexpected patterns and
    relationships in sets of data.
      Data Mining, What’s it?

   Reveals information that is hidden and unexpected, as
    little value in finding patterns and relationships that
    are already intuitive.
   Patterns and relationships are identified by examining
    the underlying rules and features in the data.
   Tends to work from the data up and most accurate
    results normally require large volumes of data to
    deliver reliable conclusions.
      Data Mining, What’s it?

   Starts by developing an optimal representation of
    structure of sample data, during which time knowledge
    is acquired and extended to larger sets of data.

   Data mining can provide huge paybacks for companies
    who have made a significant investment in data

   Relatively new technology, however already used in a
    number of industries.
      Topic 4: Data Mining vs. Data
   Data Mining does not require that a Data Warehouse be
    built. Often, data can be downloaded from the operational
    files to flat files that contain the data ready for the data
    mining analysis.

   Data Mining can be implemented rapidly on existing
    software and hardware platforms. Data Mining tools can
    analyze massive databases to deliver answers to questions
    such as, “ Which customers are most likely to respond to
    my next promotional mailing, and why?”
       Data Mining vs. Data
   Major challenge to exploit data mining is identifying suitable data
    to mine.

   Data mining requires single, separate, clean, integrated, and self-
    consistent source of data.

   A data warehouse is well equipped for providing data for mining.

   Data quality and consistency is a pre-requisite for mining to
    ensure the accuracy of the predictive models. Data warehouses are
    populated with clean, consistent data.
       Data Mining vs. Data
   Advantageous to mine data from multiple sources to discover as
    many interrelationships as possible. Data warehouses contain data
    from a number of sources.

   Selecting relevant subsets of records and fields for data mining
    requires query capabilities of the data warehouse.

   Results of a data mining study are useful if there is some way to
    further investigate the uncovered patterns. Data warehouses
    provide capability to go back to the data source.
      Topic 5: How Data Mining
   How exactly is Data Mining able to tell you important
    things that you didn’t know or what is going to happen
    next? The technique in Data Mining is called Predictive
    Modeling which is knowledge discovery process via
    relationships and patterns in broad sense.

   Modeling is the act of building a model in one situation
    where you know the answer and then applying it to another
    situation that you don’t.
      Examples of Applications of Data
      Mining via relationships and patterns

   Retail / Marketing
      Identifying buying patterns of customers
      Finding associations among customer demographic
      Predicting response to mailing campaigns
      Market basket analysis
      Examples of Applications of Data
      Mining via relationships and patterns
   Banking
      Detecting patterns of fraudulent credit card use
      Identifying loyal customers
      Predicting customers likely to change their credit
       card affiliation
      Determining credit card spending by customer
      Examples of Applications of Data
      Mining via relationships and patterns
   Insurance
      Claims analysis
      Predicting which customers will buy new policies.

   Medicine
     Characterizing patient behaviour to predict surgery
     Identifying successful medical therapies for
      different illnesses.
      Examples of Applications of Data
      Mining via relationships and patterns
   Customer profiling: characteristics of good customers are
    identified with the goals of predicting who will become
    one and helping marketers target new prospects.

   Targeting specific marketing promotions to existing and
    potential customers offers similar benefits.

   Market-basket analysis: With Data Mining, companies can
    determine which products to stock in which stores, and
    even how to place them within a store.
      Examples of Applications of Data
      Mining via relationships and patterns
   Customer Relationships Management-Determines
    characteristics of customers who are likely to leave for a
    competitor, a company can take action to retain that
    customer because doing so is usually for less expensive
    than acquiring a new customer.

   Fraud detection- With Data Mining, companies can
    identify potentially fraudulent transactions before they
        Topic 6: Data Mining Operations
        and Associated Techniques

In previous foils, predictive modeling in essence includes
other operations shown in the above table.
  Descriptive: The dealer sold 200 cars last month.

            Operational    (OLTP)

Explanatory: For every increase in 1 % in the interest,
auto sales decrease by 5 %.

            Traditional DW

     Predictive: predictions about future buyer behavior.

                   Data Mining
Level of Modeling vs. Level of Analytical Processing

   Descriptive             Explanatory                Predictive

   & REPORTS               PROCESSING                  DETERMINE IF
                                                       ANY PATTERNS
                           ANALYZE WHAT                EXIST BY REVIEWING
                           HAS PREVIOUSLY              DATA RELATIONSHIPS
                           OCCURRED TO
                           BRING ABOUT THE
                           CURRENT STATE
                           OF THE DATA
        Normalized        Denormalized       +   Statistical Analysis/
        Tables            Tables                 Artificial Intelligence

                     Roll-up; Drill Down     Classification & Value Prediction
      Predictive Modelling

   Similar to the human learning experience
      uses observations to form a model of the important
       characteristics of some phenomenon.

   Uses generalizations of ‘real world’ and ability to fit
    new data into a general framework.

   Can analyze a database to determine essential
    characteristics (model) about the data set.
      Predictive Modelling

   Model is developed using a supervised learning
    approach, which has two phases: training and testing.

     Training  builds a model using a large sample of
      historical data called a training set.
     Testing involves trying out the model on new,
      previously unseen data to determine its accuracy
      and physical performance characteristics.
      Predictive Modelling

   Applications of predictive modelling include customer
    retention management, credit approval, cross selling,
    and direct marketing.

   Two techniques associated with predictive modelling:
    A. classification
    B. value prediction, distinguished by nature of the
       variable being predicted.
      Statistical Analysis of Actual Sales (dollars
      and quantities) relative To these Signage
      Variables-a predictive modeling example.
   Content
   Frequency
   Depth
   Focus
   Depth
   Scale
   Length
   Location

Statistical Analysis : Correlation, Regression, Experiment Design,
Optimization. Now it goes into real time analysis.

   There are two techniques associated with predictive
    modeling: classification and value prediction, which are
    distinguished by the nature of the variable being
      Predictive Modelling - Classification

   Used to establish a specific predetermined class for
    each record in a database from a finite set of possible,
    class values.

   Two specializations of classification: tree induction and
    neural induction.
Example of Classification using
Tree Induction
Example of Classification using
Tree Induction
     Customer renting property
     > 2 years
       No            Yes

Rent property       Customer age>45
                    No           Yes

         Rent property      Buy property
Example of Classification using
Neural Induction
    Example of Classification using
    Neural Induction
   Each processing unit (circle) in one layer is connected
    to each processing unit in the next layer by a weighted
    value, expressing the strength of the relationship. The
    network attempts to mirror the way the human brain
    works in recognizing patterns by arithmetically
    combining all the variables with a given data point.

   In this way, it is possible to develop nonlinear
    predictive models that ‘learn’ by studying
    combinations of variables and how different
    combinations of variables affect different data sets.
      Predictive Modelling - Value
   Used to estimate a continuous numeric value that is
    associated with a database record.

   Uses the traditional statistical techniques of linear
    regression and non-linear regression.

   Relatively easy-to-use and understand.
      Predictive Modelling - Value
   Linear regression attempts to fit a straight line through
    a plot of the data, such that the line is the best
    representation of the average of all observations at that
    point in the plot.

   Problem is that the technique only works well with
    linear data and is sensitive to the presence of outliers
    (i.e.., data values, which do not conform to the expected
      Predictive Modelling - Value
   Although non-linear regression avoids the main
    problems of linear regression, still not flexible enough
    to handle all possible shapes of the data plot.

   Statistical measurements are fine for building linear
    models that describe predictable data points, however,
    most data is not linear in nature.
      Predictive Modelling - Value
   Data mining requires statistical methods that can
    accommodate non-linearity, outliers, and non-numeric

   Applications of value prediction include credit card
    fraud detection or target mailing list identification.
      Database Segmentation

   Aim is to partition a database into an unknown number
    of segments, or clusters, of similar records.

   Uses unsupervised learning to discover homogeneous
    sub-populations in a database to improve the accuracy
    of the profiles.
      Database Segmentation

   Less precise than other operations thus less sensitive to
    redundant and irrelevant features.

   Sensitivity can be reduced by ignoring a subset of the
    attributes that describe each instance or by assigning a
    weighting factor to each variable.

   Applications of database segmentation include
    customer profiling, direct marketing, and cross selling.
Example of Database Segmentation
using a Scatter plot
       Database Segmentation
   Associated with demographic or neural clustering
    techniques, distinguished by:
      Allowable data inputs
      Methods used to calculate the distance between
      Presentation of the resulting segments for analysis.
Example of Database Segmentation
using a Visualization
      Link Analysis

   Aims to establish links (associations) between records,
    or sets of records, in a database.

   There are three specializations
      Associations discovery
      Sequential pattern discovery
      Similar time sequence discovery

   Applications include product affinity analysis, direct
    marketing, and stock price movement.
      Link Analysis - Associations
   Finds items that imply the presence of other items in
    the same event.

   Affinities between items are represented by association
      e.g. ‘When customer rents property for more than 2
       years and is more than 25 years old, in 40% of cases,
       customer will buy a property. Association happens
       in 35% of all customers who rent properties’.
      Link Analysis - Sequential Pattern
   Finds patterns between events such that the presence of
    one set of items is followed by another set of items in a
    database of events over a period of time.

     e.g.Used to understand long term customer buying
      Link Analysis - Similar Time
      Sequence Discovery
   Finds links between two sets of data that are time-
    dependent, and is based on the degree of similarity
    between the patterns that both time series demonstrate.
      e.g. Within three months of buying property, new
       home owners will purchase goods such as cookers,
       freezers, and washing machines.
      Deviation Detection

   Relatively new operation in terms of commercially
    available data mining tools.

   Often a source of true discovery because it identifies
    outliers, which express deviation from some previously
    known expectation and norm.
      Deviation Detection

   Can be performed using statistics and visualization
    techniques or as a by-product of data mining.

   Applications include fraud detection in the use of credit
    cards and insurance claims, quality control, and defects
       A Summary: Data-Driven
   Data Visualization

   Decision Trees

   Clustering

   Factor Analysis

   Neural Network

   Association Rules

   Rule Induction

* Based on Sakhr Youness’s book “ Professional Data Warehousing
  with SQL Server 7.0 and OLAP Services
    Data Visualization
A pie chart showing the sales of a product by region is
Sometimes much more effective than presenting the same
Data in a text or tabular form.

                         Northeast     South   11 %
                     39%                    North

                                                    21 %
                                     20 %
Decision Tree
    Cluster Analysis
First segment (high income>8,000)
 Second Segment (8000>middle income >3000)

Third Segment (low income < 3000)            Last car is
                                             A used one

                                              Own car
       Factor Analysis
   Unlike cluster analysis, factor analysis builds a model from data.
    The technique finds underlying factors, also called “latent
    variables” and provides models for these factors based on
    variables in the data. For ex., a software company is considering a
    survey to find out the nine most perceived attributes of one of their
    products. They might categorize these products to categories such
    as service for technical support, availability for training and a help

   Factor analysis is used for grouping together products based on a
    similarity of buying patterns so that vendors may bundle several
    products as one to sell them together at a lower price than their
    added individual prices..
Neural Networks
       Association Rules

   Association models are models that examine the extent to which
    values of one field depend on, or are produced by, values of
    another field. These models are often referred to as Market Basket
    Analysis when they are applied to retail industries to study the
    buying patterns of these customers, especially in grocery and retail
    stores that issue their own credit cards. Charging against these
    cards gives the store the chance to associate the purchases of
    customers with their identities, which allows them to study
    associations among other things.
       Rules Induction

   This is a powerful technique that involves a large number of rules
    using a set of “if..then” statements in the pursuit of all possible
    patterns in the dataset. For ex., if the customer is a male then, if he
    is between 30 and 40 years of ages, and his income is less than
    $50,000 and more than $20,000, he is likely to be driving a car that
    was bought as new.
       A Summary: Theory-Driven
   Correlations

   T-Tests

   Analysis of Variables

   Linear Regression

   Logistic Regression

   Discriminate Analysis

   Forecasting Methods
      Topic 7: The Data Mining Process

   Define the problem.
   Select the data.
   Prepare the data.
   Mine the data.
   Deploy the model.
   Take business action.
   Are you ready for Data Mining?
       Define the problem

   A successful data mining initiative always starts with
    a well-defined project. To insure that the project produces
    incremental value, include an assessment of the status quo
     solution and a review of technology, organization, and
    business processes.
       Select the data

   This step involves defining your data source . (not every
    data source and record is required.) The data is usually
    extracted from the source system to a separate server.
      Prepare the data

   This step represents up to 80 percent of the total project
    effort. For data mining, the data must reside in one flat
    table (each record has many columns). In addition to being
    the most time consuming, the step is also the most critical.
    The resulting models are only as good as the data used to
    create them.
      Mine the data

   Typically the easiest and shortest phase, this step involves
    applying statistical and AI tools to create mathematical
    models. Data mining typically occurs on a server separate
    from the data warehousing and other corporate systems.
      Deploy the Model

   Model deployment is the process of implementing the
    mathematical models into operational systems to improve
    business results.
      Take Business Action

   Use the deployed model to achieve improved results to the
    business problem identified at the beginning of the
       Step to Implement Data Mining

Discovery (patterns, relations
                                      Prior Knowledge
     Associations, etc.)

                     Information Model



Just because you have a data warehouse doesn’t mean
you’re necessarily ready for data mining. Much of the
work our company does in the data mining arena has
more to do with data mining readiness assessment than
with actually performing data mining.
      Metrics you can use to gauge your data
      mining readiness

   Do you have a staff of experienced knowledge workers?
   Do you have the data?
   Do you have marketing processes in place that can use this
   Do you have a business champion who can embrace the
    process and results?
   Do you have the technology infrastructure to support
    advanced analysis?
     Topic 8: Data Mining Tools

Data mining tools are typically classified by the type of
algorithm they use to identify hidden patterns. There are
many different algorithms in use, but the four most
popular are association, sequence, clustering (or
segmentation), and predictive modeling.
      Data Mining Tools

   There are a growing number of commercial data
    mining tools on the marketplace.

   Important characteristics of data mining tools include:
      Data preparation facilities
      Selection of data mining operations
      Product scalability and performance
      Facilities for visualization of results.
     Data Mining vs. OLAP

They are two separate breeds of analysis with
entirely different objectives, not to mention
tools, skill sets, and implementation methods.
     Data Mining
 With  canned reports, ad hoc querying, and
 OLAP, the end user defines a hypothesis and
 determines which data to examine. With data
 mining, the tool identifies the hypothesis, and it
 actually tells the user where in the data to start
 the exploration process.
     Data Mining
Rather than using SQL to filter out values and methodically
reduce the data into a concise answer set, data mining uses
algorithms that exhaustively review the relationships among
data elements to determine if any patterns exist. The whole
purpose of data mining is to yield new business information
that a business person can act on.
       OLAP vs. Data Mining Tools
     OLAP Tools                           Data Mining Tools
   Are ad hoc, shrink wrapped            Methods for analyzing
    tools that provide an interface        multiple data types
    to data                                 -- Regression Trees
                                             -- Neural networks
   Are used when you have                   -- Genetic algorithms
    specific known questions
                                          Are used when you don’t
   Looks and feels like a                 know what the questions are
    spreadsheet that allow
    rotation, slicing and graphic
                                          Usually textual in nature
   Can be deployed to large
    number of users                       Usually deployed to a small
                                           number of analysts
      Data Mining Tools


    Association, also frequently referred to as "affinity
    analysis," reviews numerous sets of items and looks for
    common groupings. An example of association is market
    basket analysis, which involves reviewing the products that
    consumers purchase in a single trip to the grocery store.

 Finds    items that imply the presence of other items
    in the same event.

   Affinities between items are represented by
    association rules.
     e.g. ‘When a customer rents property for more than 2
      years and is more than 25 years old, in 40% of cases,
      the customer will buy a property. This association
      happens in 35% of all customers who rent properties’.
       Data Mining Tools


    Sequential analysis helps data miners identify a set of
    order-specific items or events. Association identifies the
    existence of patterns or groups of items; sequential
    analysis identifies the order of those patterns or groups of

   Finds patterns between events such that the presence of
    one set of items is followed by another set of items in a
    database of events over a period of time.
    e.g. Used to understand long term customer buying
       Link Analysis - Similar Time Sequence

   Finds links between two sets of data that are time-
    dependent, and is based on the degree of similarity
    between the patterns that both time series demonstrate.

    e.g. Within three months of buying property, new home
    owners will purchase goods such as cookers, freezers, and
    washing machines.
       Data Mining Tools


    Cluster analysis lets the data miner assemble data into
    unforeseen groups containing similar characteristics. Also
    known as "segmentation," this type of data
    mining is probably the most widely used.

   Aim is to partition a database into an unknown number of
    segments, or clusters, of similar records.

   Uses unsupervised learning to discover homogeneous sub-
    populations in a database to improve the accuracy of the
      Data Mining Tools


    As the name implies, predictive modeling involves
    developing a model from historical data for predicting a
    future event. The power of predictive modeling engines is
    that they can use a broad range of data attributes to identify
    future behavior. Both cluster analysis and predictive
    modeling tools identify distinct groups of items with
    common attributes; the difference is that predictive
    modeling focuses on the likelihood of a particular outcome
    for a particular group.
       Topic 9: Data Mining Applications
       for CRM
   Which customers are most profitable to me? Why?
   What promotions are most effective? For which customers?
   What kind of customers will be interested in my new product?
   What customers are at risk to defect to my competitor?
   How do I identify prospects with the greatest profit potentials?

Customer information is rapidly becoming a company’s most
important asset to answer these questions. However, to answer these
Questions in broad generalities is not enough. Each customer must be
Analyzed and potentially treated uniquely. Customer relationship
management provides the framework for analyzing customer
Profitability and improving marketing effectiveness.
      Customer Relationship
      Management -Framework
Many organizations have collected and stored a wealth of data about
their Customers, suppliers, and business partners. However, the
inability to Discover valuable information hidden in the data prevents
these organizations From transforming this data into knowledge. The
business desire is, therefore, to Extract valid, previously unknown,
and comprehensible information from large Databases and use it for
profits. To fulfill these goals, organizations need to follow these steps:

- Capture and integrate both the internal and external data into a
  comprehensive view that encompasses the whole organization.
- “Mine” the integrated data for information.
- Organize and present the information with knowledge for decision-
       Customer Relationship
       Management -Framework
From the architecture point of view, the entire CRM framework can
Be classified into three key components:

   Operational CRM – The automation of horizontally integrated
    business processes, including customer touch-points, channels, and
    front-back office integration.

   Analytical CRM- The analysis of data created by the Operational

   Collaborative CRM- Applications of Collaborative services
    including e-mail, personalized publishing, e-communities, and
    similar vehicles designed to facilitate interactions between
    customers and organizations.
                  CRM Architecture
                   Business Rules and Metadata Management

Data Sources               Market Data Decision Support Communication
Contact History            Store       Applications     Channels

                                                                   Direct Mails

                                        Campaign    Campaign
                                        Mgt         Mgt            Contact Mgt
 Transaction       ETL                                               Call Center
 History           Tools                                             Call Center
                                                                  Customer Service
                                        Analytics   Data Mining   Center

                           Marketing    Data Mart   Analytics
                           Data Marts

                                        Reporting   Reporting        Other
  External Data
                                        Data Mart   Data Mart
                            Workflow Management
       CRM -The Business Perspective
Tools and technologies will be applied to these real CRM business problems.
They are:

   Customer Profitability – provides a blueprint for how to define and use
    customer profitability as the bedrock for your CRM processes.

   Customer Acquisition – shows how to use data mining to acquire new
    customers in the most profitable way possible.

   Customer Cross-selling – details how the technology architecture can be
    used to increase the value of existing customers by applying more to them.

   Customer Retention – uses a case study from the telecommunications
    industry to show how to execute successful CRM systems to retain your
    profitable customers.

   Customer Segmentation – provides the business methodology of how to
    segment and manage your customers in a consistent and repeatable way
    across the enterprise.
      Information Mining and Knowledge
      Discovery for Effective CRM
In the current and emerging competitive and highly dynamic business
Environment, only the most competitive companies will achieve
sustained market success. In order to capitalize on business
Opportunities, these organization will distinguish themselves by the
Capacity to leverage information about their marketplace, customers,
And operations. A central part of this strategy for long-term
Sustaining success will be an active information repository- an
Advanced data warehouse, in which information from various
Applications or parts of the business is coalesced and understood.
        Information Mining
The shortest path from complex data to knowledge discovery is
Information mining instead of data mining to reflect the rich variety
Of forms that information required for business intelligence can take.
Information mining implies using powerful and sophisticated tools to
Do the following:
     Uncover associations, patterns, and trends

     Detect deviations

     Group and classify information

     Develop predictive models
      Information Mining

From a technical perspective, the real keys to successful information
Mining are its algorithms: complex mathematical processes that
Compare and correlate data. Algorithms enable an information
mining application to determine who the best customers for the
Business are or what they like to buy. They can also determine at
what time of day, in what combinations, or how an organization can
Optimize inventory, pricing, and merchandising in order to retain
These customers and cause them to buy more, at increased profit
Margins. A large volume of information is stored in anon-numeric
Forms: documents, images and video files.
      Text Mining and Knowledge
Text Mining is a subset of information mining technology that, in
turn, is a Component of a more general category of Knowledge
Management (KM) Knowledge, in this case, refers to the collective
expertise, experiences, know-How, and wisdom of an organization. In
a business world, knowledge is Represented not only by the
structured data found in traditional database, But in a wide variety of
unstructured sources such as word documents, Memos and letters, e-
mail messages, news feeds, Web pages, and so forth.
      Text Mining and Knowledge
Unlike data mining, text mining works with information stored in an
Unstructured collection of text documents. Specifically, online text
Mining refers to the process of searching through unstructured data
On the internet and deriving some meaning from it. Text mining goes
beyond applying statistical models to data files; in fact, text mining
Uncovers relationships in a text collection, and leverages the
creativity of the knowledge work to explore these relationships and
Discover new knowledge.
       Text Mining Technologies

There are two key key technologies that make online text mining

   Internet Searching - It has been around for a quite few years.
    Yahoo, Alta Vista, and Excite are three of the earliest. Search
    engines (and discovery services) operate by indexing the context in
    a particular Web site and allows users to search the indexes.
    Although useful, first generations of these tools often were wrong
    because they did nit correctly index the content they retrieved.
    Advances in text mining applied to the internet searching resulted
    in online text mining, representing the new generation of Internet
    search tools. With these products, users can gain more relevant
    information by processing smaller amount of links, pages and
       Text Mining Technologies

   Text Analysis - It has been around longer than Internet searching.
    Indeed, scientists have been trying to make computers understand
    natural languages for decades; text analysis is an integral part of
    these efforts. The automatic analysis of text information can be
    used for several different general purposes:

    1. To provide an overview of the contents of a large document
    collection, for ex., finding significant clusters of documents in a
    customer feedback collection could indicate where a company’s
    products and services need improvement.

    2. To identify hidden structures between groups of objects; this
    may help to organize an intranet site so that related documents are
    all connected by hyperlinks.
   Text Mining Technologies

3. To increase the efficiency and effectiveness of a search process to
find similar or related information; for ex., to search articles from
a news service and discover all unique documents that contain
hints on possible trends or technologies that have so far not been
mentioned in their articles.

4. To detect duplicate documents in an article.
       Text Mining Technologies-
1.   E-mail management. A popular use of text analysis is for messae routing
      in which the computer “reads” the message to decide who should deal
      with it. (Spam control is another good example)
2.   Document Management. By mining the different documents for meaning
      as they are put into a document repository, a company can establish a
      detailed index that allows the location of relevant documents at any
3.   Automated help desk. Some companies use text mining to respond to
      customer inquiries. Customers’ letters and e-mails are processed by a
      text mining applications.
4.   Market research. A market researcher can use online text mining to
      gather statistics on the occurrences of certain words,c phases, concepts,
      or themes on the World Wide Web. This information can be useful for
      establishing market demographics and demand curves.
5.   Business intelligence gathering. This is the most advanced use of text
      mining. (See next slide)
Blogger is one of the most popular online blogging tool, works with
any browser, and is free, well designed and easy to use. Millions of
people are changing their information acquisition habits, and the web
Log, or “blog” has become a popular source.

   Title-Publishing a blog with blogger/by Elizabeth Castro,
    Berkeley, Calif, Peachpit, 2005
   Title- Blog: Understaning the information that’s changing your
    world/ Hugh Howitt, Nashiville, Tenn, Nelson Books, c2005
   Webblogs (isbn 0321321235)
      CRM in the e=Business World

As e-business continues to mature and affect radical changes throughout all
Aspects of the businesses, the focus of new e-business-enabled application
Software will shift away from narrowly defined commerce platforms toward
A broader vision of managing customer relationships.

A new model that Forrester Research calls eRelationship Management (eRM)
Is defined as follows:

“A Web-centric approach to synchronizing customer relationships across
Communication channels, business functions, and audiences”
       CRM in the e=Business World

To implement this new e-business CRM model, companies should do the

   Create a dynamic customer context that can address every customer
    interaction that is different from a view of the customer constructed from
    data contained in the applications. This can be achieved by collecting and
    organizing customer data, calculating high-level matrices for each
    customer (I.e., customer profitability, satisfaction, and churn potential),
    and assembling and delivering dynamic context to customer touch points.
   Generate consistent, custom responses by delivering a consolidated rules
    engine for routing, workflow, personalization, smart navigation, and
    consistent treatment of customers
   Build and maintain a Content Directory to point to company, products,
    and business partner content; and give to employees, business partners,
    and customers.
       Topic 10: Data Mining From US
       Government Printing Office
   Washington, March 25, 2003. Subcommittee on Technology,
    Information Policy, Intergovernmental Relations and the Census
    Oversight hearing on “Data Mining: Current Applications and the
    Future Possibilities”-Available via or

   Background: The hearing will explore instances where data mining
    technology is currently employed, examine the benefits and the
    pitfalls, and discuss the potential uses of data mining at the Federal
    level of government. A specific focus on privacy and abuse concerns
    surrounding this technology.
       Data Mining: Current Applications and
       the Future Possibilities
   Data Mining technology has been utilized successfully for many years
    in both the private and public sectors to identify and analyze useful
    data that would otherwise be overlooked or inaccessible.

   Government agencies have also used data mining techniques quite
    extensively to identify and eliminate fraud, waste and abuse. States
    work with localities by providing them access to their data sources.
    This has allowed local and state enforcement agencies to zero in on tax
    evaders, perpetrators of financial crimes or those conducting any
    number of fraudulent activities. At the federal level, the Treasury
    Department uses this technology to identify and prosecute money
    laundering schemes, the IRS to track down delinquent taxpayers, and
    the US Customers to identify drug trafficking activities at U.S,
       Topic 11: Data Mining Techniques-
       A Summary
   Artificial neural networks: Non-linear predictive models that learn
    through training and resembles biological neural networks in
   Decision Trees: Tree-shaped structures that represent sets of
    decisions. These decisions generate rules for the classification of a

   Generic Algorithms: Optimization techniques that use processes
    such as generic combination, mutation, and natural selection in a
    design based on the concepts of revolution.

   Rule induction: The extraction of useful if-then rules from data
    based on statistical significance.
      Data Mining Techniques- A
   Predictive modeling        Classification
                               Value prediction
   Database Segmentation      Demographic clustering
                               Neural clustering
                               Association discovery
   Link analysis
                               Sequential pattern discovery
                               Similar time sequence
   Deviation detection        Statistics
                               Visualization
      Two Types of Data Mining Modeling-
      Verification and Discovery

   The verification model utilizes a process that looks in a
    database to detect trends and patterns in data that will help
    answer some specific questions about the business.

   In this mode, the user generates a hypothesis about the
    data, issues a query against the data and examines the
    results of the query looking for verification of the
    hypothesis or the user decides that the hypothesis is not
       Verification Model

   In this model, very little information is created in this
    extraction process: either the hypothesis is verified or it is

   Common tools used in this mode are: queries,
    multidimensional analysis and visualization. What all have
    in common are that the user is essentially ‘guiding’ the
    exploration of the data being inspected.
      Discovery Model

   A more popular model is the Discovery Model that utilizes
    a process that looks in a database to discover and/or predict
    future patterns. The discovery model is divided into two
    modes: “Descriptive” and “Predictive”.
      Discovery Model- Descriptive Mode

   The Descriptive mode finds hidden patterns without a
    predetermined idea or hypothesis about what the patterns
    may be. In other words, the Data Mining software or
    program takes the initiative in finding what the interesting
    patterns are, without the user thinking of the relevant
    questions first. In this mode information is created about
    the data with very little or guidance from the user. The
    exploration of the data is done in such a way as to yield as
    large a number of useful facts about the data in the shortest
    amount of time.
       Discovery Model- Predictive Mode

   In the Predictive mode patterns discovered from the database are used
    to predict the future patterns or trends. Predictive modeling allows the
    user to submit records with some unknown field values, and the system
    will guess the unknown values based on previous patterns discovered
    from the database.

   In comparing the two models, one can state that “Verification” can be
    very inefficient, timely and costly. Whereas, “Discovery” modeling
    can be very efficient, cost effective, less dependent on user input and
    increases modeling accuracy.

To top