DB MG DB MG

Document Sample
DB MG DB MG Powered By Docstoc
					                                                                                                                        Outline
                                                                                                                    Data mining fundamentals
            Data mining                                                                                             Association rules fundamentals
                                                                                                                    Disk-based pattern mining
       on very large databases                                                                                      Other research topics


                            DB G
                             M  Data Base and Data Mining Group of Politecnico di Torino




                 Elena Baralis, Tania Cerquitelli
                              Politecnico di Torino
                     http://dbdmg.polito.it/twiki/bin/view/Public/WebHome

                            Torino, may 13th 2009                                                                DB G
                                                                                                                  M
                                                                                                                                                                      2




                                                                                                                        Data analysis
                                                                                                                   Most companies own huge databases
                                                                                                                   containing
                                                                                                                        operational data
   Data mining fundamentals                                                                                             textual documents
                                                                                                                        experiment results
                                                                                                                   These databases are a potential
                            DB G
                             M
                                                                                                                   source of useful information
                                Data Base and Data Mining Group of Politecnico di Torino




                 Elena Baralis, Tania Cerquitelli
                              Politecnico di Torino
                                                                                                                 DB G
                                                                                                                  M
                                                                                                                                                                      4




       Data analysis                                                                                                    Data mining
 Information is “hidden” in huge datasets                                                                          Non trivial extraction of
   not immediately evident                                                                                               implicit
   human analysts need a large amount of time for the                                                                    previously unknown
   analysis                                                                                                              potentially useful
   most data is never analyzed at all                                                                               information from available data
                                                                                                                   Extraction is automatic
         4,000,000

         3,500,000                            The Data Gap
         3,000,000                                                                                                       performed by appropriate algorithms
         2,500,000
                                                                                                                   Extracted information is represented by means of
                                                                                                                   abstract models
         2,000,000

         1,500,000
                               Disk space (TB)
                                 since 1995
         1,000,000                                                                                                       denoted as pattern
          500,000
                                                                                                  Analyst
                0
                                                                                                  number

DB G                                                                                                             DB G
                     1995           1996                       1997                        1998       1999
                                                                                                             5                                                        6
 M                                                    From R. Grossman, C. Kamath, V. Kumar, “Data
                                                      Mining for Scientific and Engineering Applications”         M
       Example: biological data                                                                                                                        Biological analysis objectives
  Microarray                                                                                                                                       Clinical analysis
       expression level of genes in a cellular tissue                                                                                                    detecting the causes of a pathology
       various types (mRNA, DNA)                                                                                                                         monitoring the effect of a therapy
                                                                                                                                                       ⇒ diagnosis improvement and definition of new specific
  Patient clinical records                       CLID
                                                         PATIENT shx013: shv060: shq077: shx009: shx014: shq082:         shq083:     shx008:
                                                                                                                                                         therapies
       personal and demographic data
                                                           ID     49A34 45A9 52A28 4A34 61A31 99A6                        46A15       41A31

                                               IMAGE:740ISG20 || int -1.02     -2.34    1.44     0.57   -0.13     0.12        0.34      -0.51
                                                                                                                                                   Bio-discovery
       exam results                            IMAGE:767TNFSF13 | -0.52        -4.06   -0.29     0.71    1.03    -0.67        0.22      -0.09

                                                                                                                                                          gene network discovery
                                               IMAGE:366LOC93343 -0.25         -4.08    0.06     0.13    0.08     0.06       -0.08      -0.05
                                               IMAGE:235ITGA4 || int -1.375   -1.605   0.155   -0.015   0.035   -0.035       0.505     -0.865

  Textual data in public collections                                                                                                                      analysis of multifactorial genetic pathologies
       heterogeneous formats, different objectives
                                                                                                                                                   Pharmacogenesis
       scientific literature (PUBMed)
                                                                                                                                                          lab design of new drugs for genic therapies
       ontologies (Gene Ontology)
                                                                                                                                                              How can data mining contribute?
DB G
 M
                                                                                                                            7                   DB G
                                                                                                                                                 M
                                                                                                                                                                                                                                         8




       Data mining contributions                                                                                                                       Knowledge Discovery Process
  Pathology diagnosis
                                                                                                                                                                selection
       classification
  Selecting genes involved in a                                                                                                                                             preprocessing
  specific pathology
       feature selection                                                                                                                                                              transformation
       clustering                                                                                                                               data
  Grouping genes with similar                                                                                                                          selected                                  data mining
  functional behavior                                                                                                                                    data
       clustering                                                                                                                                             preprocessed                                             interpretation
                                                                                                                                                                  data
  Multifactorial pathologies analysis                                                                                                                                   transformed
       association rules                                                                                                                                                    data
  Detecting chemical components appropriate for specific                                                                                                                                     pattern
  therapies                                                                                                                                                                                                  knowledge
       classification
                                                                                                                                                KDD = Knowledge Discovery from Data

DB G
 M
                                                                                                                            9                   DB G
                                                                                                                                                 M
                                                                                                                                                                                                                                         10




       Preprocessing                                                                                                                                   Data mining origins
                                                 data cleaning
                                                 • reduces the effect of noise
                                                 • identifies or removes outliers
                                                                                                                                                 Draws from
                           preprocessing         • solves inconsistencies                                                                              statistics, artificial intelligence (AI)
                                                                                                                                                       pattern recognition, machine
                                                 data integration
                                                 • reconciles data extracted
                                                                                                                                                       learning                                                    Machine Learning,
                                                                                                                                                                                               Statistics,
        selected                                   from different sources                                                                              database systems                            AI                   Pattern
          data                                   • integrates metadata                                                                                                                                                Recognition
                    preprocessed                 • identifies and solves data                                                                    Traditional techniques are not
                        data                       value conflicts
                                                                                                                                                 appropriate because of                                 Data Mining
                                                 • manages redundancy
                                                                                                                                                       significant data volume
                        Real world data is “dirty”                                                                                                     large data dimensionality                          Database

           Without good quality data, no good quality                                                                                                  heterogeneous and distributed                      systems

                           pattern                                                                                                                     nature of data
                                                                                                                                                                                                    From: P. Tan, M. Steinbach, V.
                                                                                                                                                                                                  Kumar, “Introduction to Data Mining”

DB G
 M
                                                                                                                           11                   DB G
                                                                                                                                                 M
                                                                                                                                                                                                                                         12
       Analysis techniques                                                        Classification
                                                                                      Objectives
  Descriptive methods                                                                    prediction of a class label
       Extract interpretable models describing data                                      definition of an interpretable model of a given
       Example: client segmentation                                                      phenomenon

  Predictive methods                                                                    training data

       Exploit some known variables to predict
                                                                                                          model
       unknown or future values of (other) variables                                                      model

       Example: “spam” email detection                                                unclassified data                              classified data




DB G
 M
                                                                          13   DB G
                                                                                M
                                                                                                                                                         14




    Classification                                                                Classification
                                         • Approaches                                                                      • Requirements
                                            –   decision trees
                                                                                                                                 –   accuracy
                                            –   bayesian classification
                                                                                                                                 –   interpretability
                                            –   classification rules
                                                                                                                                 –   scalability
                                            –   neural networks
                                                                                                                                 –   noise and outlier
                                            –   k-nearest neighbours                                                                 management
          training data                     –   SVM                                     training data


                               model
                               model                                                                      model
                                                                                                          model

        unclassified data                           classified data                   unclassified data                              classified data




DB G
 M
                                                                          15   DB G
                                                                                M
                                                                                                                                                         16




    Classification                                                                    Clustering
       Applications                                                               Objectives
          detection of customer propension to leave a company
          (churn or attrition)                                                        detecting groups of similar data objects
          fraud detection                                                             identifying exceptions and outliers
          classification of different pathology types
          …

         dati di training


                               modello
                               modello

       dati non classificati                       dati classificati




DB G
 M
                                                                          17   DB G
                                                                                M
                                                                                                                                                         18
         Clustering                                                                         Clustering
                                       • Approaches                                          Applications
                                           – partitional (K-means)                                 customer segmentation
                                           – hierarchical                                          clustering of documents containing similar information
                                           – density-based (DBSCAN)                                grouping genes with similar expression pattern
                                           – SOM                                                   …


                                           • Requirements
                                                 – scalability
                                                 – management of
                                                    – noise and outliers
                                                    – large dimensionality
                                                 – interpretability


DB G
 M
                                                                               19   DB G
                                                                                     M
                                                                                                                                                                  20




         Association rules                                                                   Association rules
        Objective                                                                           Applications
          extraction of frequent correlations or pattern from a                               market basket analysis
          transactional database                                                              cross-selling
                                                                                              shop layout or catalogue design
 Tickets at a supermarket                                                            Tickets at a supermarket
 counter                                    Association rule                         counter                                     Association rule
  TID    Items                                  diapers ⇒ beer                        TID    Items                                   diapers ⇒ beer
  1      Bread, Coke, Milk                                                            1      Bread, Coca Cola, Milk
                                                 2% of transactions contains                                                        2% of transactions contains
  2      Beer, Bread                             both items                           2      Beer, Bread                            both items
  3      Beer, Coke, Diapers, Milk               30% of transactions                  3      Beer, Coca Cola, Diapers, Milk         30% of transactions
  4      Beer, Bread, Diapers, Milk              containing diapers also              4      Beer, Bread, Diapers, Milk             containing diapers also
  5      Coke, Diapers, Milk                     contains beer                        5      Coca Cola, Diapers, Milk               contains beer
  …        …                                                                          …        …


DB G
 M
                                                                               21   DB G
                                                                                     M
                                                                                                                                                                  22




         Other data mining techniques                                                        Open issues
      Sequence mining
           ordering criteria on analyzed data are taken into
           account                                                                        Scalability to              huge data volumes
           example: motif detection in proteins
                                                                                          Data dimensionality
      Time series and geospatial data
           temporal and spatial information are considered                                Complex data structures, heterogeneous data
           example: sensor network data                                                   formats
      Regression                                           Sensor network                 Data quality
           prediction of a continuous value
           example: prediction of stock quotes                                            Privacy preservation
      Outlier detection                                                                   Streaming data
           example: intrusion detection in network traffic
           analysis


DB G
 M
                                                                               23   DB G
                                                                                     M
                                                                                                                                                                  24

				
DOCUMENT INFO