datamining

Document Sample
datamining Powered By Docstoc
					                 Introduction to Data Mining




                                               Rattapoom Tuchinda
*Some of the slides are from
Jaideep Srivastava @
http://www.cs.umn.edu/faculty/srivasta.html
Mike Kassoff @
http://logic.stanford.edu/classes/cs246/lect
ures2001/mkassoff_lecture.ppt
So far…

          Information Integration techniques
           Extraction: wrapper building
           Integration: record linkage,
           Semantic web
           Execution: streaming data flow




                    DATA
Data overloaded

 Gene data
 Customer/Sales data
 Astrophysics data
 Pricing
 ….

  And no one wants to stare at 100k tuples
What is data mining?

 A process that uses various techniques to
 discover “patterns” or knowledge from data
  –   Visualization..
  –   Machine learning algorithms..
Examples…

 Link analysis
 Frauds detection
 New medicines
 Revenue Management/Discriminatory pricing
 Marketing
 Stocks
 ….
Outline

 Introduction
 Data cleaning
 Data mining techniques
  –   Classification
  –   Clustering
  –   Association Rules
  –   Sequential Patterns
  –   Regression
  –   Deviation detection
  –   Meta-learning
 Case study: Biddingfortravel
Traditional Data Mining Process
Data is often of low quality

 Why?
  –   You didn’t collect it yourself!

  –   It probably was created for some other use, and
      then you came along wanting to integrate it

  –   People make mistakes (typos)

  –   People are busy (“this is good enough”)
Problems with data

 Some data are have problems on their own

 Other data are problematic only when you
 want to integrate it
Data with problems on their own

 Problems due to lack of structure
 Problems not due to lack of structure (it’s in a
 database)
 Government agency data

What we want:




id     name                          city       state

     1 Dept. of Transportation       New York   NY

     2 Dept. of Finance              New York   NY

     3 Office of Veteran's Affairs   New York   NY
First problem

What’s wrong here?

1'Dept. of Transportation'New York'NY
2'Dept. of Finance'New York'NY
3'Office of Veteran's Affairs'New York'NY

   The separator is used in the data.
Second problem

What’s wrong here?

1,Dept. of Transportation,New York City,NY
2,Dept. of Finance,City of New York,NY
3,Office of Veteran's Affairs,New York,NY

  We need standardization / naming
  conventions
Third problem

What’s wrong here?

1,Dept. of Transportation,New York,NY
,Dept. of Finance,New York,NY
3,Office of Veteran's Affairs,New York,NY

  A missing required field
Fourth problem

What’s wrong here?

1,Dept. of Transportation,New York,NY
Two,Dept. of Finance,New York,NY
Office of Veteran's Affairs,3,New York,NY

    No data type contraints
    Ordering
.
Fifth Problem

What’s wrong here?

1,Dept. of Transportation,New York,NY
2,Dept. of Finance,New York,NY
3,Dept. of Finance,New York,NY

  Redundancy!
Problems not due to lack of structure
(it’s in a database)

 Flags: 0, 9, null, x, “no data”
 Typos:
  –   Can use constraints to catch corrupt data (i.e., weight can’t
      be negative)
  –   Or use statistical techniques to catch corrupt data
 Hidden semantics: white spaces can be important.
 Misleading Data     building name    stories
                             Guildford Plaza       9
                             Hartford Apts.        35
                             Braun Hotel           6
Data that that is fine on its own, but
becomes problematic when you want
to integrate it
 Format
 Dynamic data
 Different granularity
 Conflicting data
Formats

 Not everyone uses the same format as you

 Dates are especially problematic:
 –   12/19/77
 –   12/19/1977
 –   12-19-77
 –   19/12/77
 –   Dec 19, 1977
 –   19 December 1977
 –   9 in Tevet, 5738
Data that Moves

 You can’t store it all in the same currency
 (say, US$) because the exchange rate
 changes
 Price in foreign currency stays the same
 Must keep the data in foreign currency and
 use the current exchange rate to convert
Data at a different level of detail than
you need

 If it is at a finer level of detail, you can
 sometimes bin it
 Example
  –   I need age ranges of 20-30, 30-40, 40-50, etc.
  –   Imported data contains birth date
  –   No problem! Divide data into appropriate
      categories
Data at a different level of detail than
you need (cont’d)

 Sometimes you cannot bin it
 Example
  –   I need age ranges 20-30, 30-40, 40-50 etc.
  –   Data is of age ranges 25-35, 35-45, etc.
  –   What to do?
        Ignore age ranges because you aren’t sure
        Make educated guess based on imported data (e.g.,
        assume that # people of age 25-35 are average # of
        people of age 20-30 & 30-40)
Conflicting Data

 Information source #1 says that George lives in
 Texas
 Information source #2 says that George lives in
 Washington, DC
 What to do?
  –   Use both (He lives in both places)
  –   Use the most recently updated piece of info
  –   Use the “most trusted” info
  –   Flag row to be investigated further by hand
  –   Use neither (We’d rather be incomplete than wrong)
Outline

 Introduction
 Data cleaning
 Data mining techniques
  –   Classification
  –   Clustering
  –   Association Rules
  –   Sequential Patterns
  –   Regression
  –   Deviation detection
  –   Meta-learning
 Case study: Biddingfortravel
Classification: Definition

 Given a collection of records (training set)
  –   Each record contains a set of attributes, one of the
      attributes is the class.
 Find a model for class attribute as a function of the
 values of other attributes.
 Goal: previously unseen records should be assigned
 a class as accurately as possible
  –   A test set is used to determine the accuracy of the mo del.
      Usually, the given data set is divided into training and test
      sets, with training set used to build the model and test set
      used to validate it.
Classification Example
Classification Techniques

 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Genetic Algorithms
 Naïve Bayes and Bayesian Network
 Support Vector Machine
What is Cluster Analysis

 Finding groups of objects such that the object in a
 group will be similar to one another and different
 from the objects in other groups.
  –   Based on information found in the data that describes the
      objects and their relationships
  –   Also known as unsupervised classification
 Many applications
  –   Understanding: group related documents for browsing
      (similar websites) or to find genes or proteins that have
      similar funtionality
Notion of a Cluster is Ambiguous
Partitional Clustering
Hierarchical Clustering
Mining Associations

 Given a set of records, find rules that will predict the
 occurrence of an item based on the occurrences of
 other items in the record
Definition of Association Rule
Association Rule Mining
Meta-learning

 Learning about …”learning”
 Combine multiple classifiers together to yield
 a better result.
 Simple voting, boosting, stacking
Stacking
Algorithm selection

 Given that we have a wide range of
 algorithms, which algorithm should I choose?
  –   Meta-learning approach [Brazdi 1995]
  –   Still an open-ended question
Outline

 Introduction
 Data cleaning
 Data mining techniques
  –   Classification
  –   Clustering
  –   Association Rules
  –   Sequential Patterns
  –   Regression
  –   Deviation detection
  –   Meta-learning
 Case study: Biddingfortravel
Case study: Bidding for travel




 Can we predict the winning hotel (or price)?
        How does it work (I think..)?
    120   A

200 B                $60                            $63
               $65

180 C
                                Priceline   Winning: A

              $68


              A: 120       60
              B: 200       65                 120 < 200 < 180
              C: 180       68
Biddingfortravel      cleaning



Hotel 1
                                 postdata
Hotel 2                 join
Hotel 3                          Biddingfortravel
      .
                                 (area, stars,hotels)
      .
Hotel N

 union             cleaning        mining
Prediction

 Given area (San Diego Coastal), stars (4*),
 checkin date, checkout date, retail price of
 each of the hotel in the area   Predict which
 hotel will I get from priceline
Ending remarks

 Data mining will always be in demand
 What makes data mining from the web so
 specials?
 –   Access to real time data
 –   Pricing data
 –   Consumer aspect

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:3
posted:10/3/2012
language:English
pages:43