INTRODUCTION TO DATA MINING

Shared by: MahmoudAlfarra
Categories
Tags
-
Stats
views:
65
posted:
2/24/2013
language:
English
pages:
35
Document Sample
scope of work template
							  College of Science & Technology
  Dep. Of Computer Science & IT
  BCs of Information Technology




  Data Mining
                          Chapter 1: Introduction




2013            Prepared by: Mahmoud Rafeek Al-Farra
                                                    www.cst.ps/staff/mfarra
    Lecturer
2

       Mahmoud Rafeek Alfarra
       Certificates:
           MSc Computer Science,2008, Pattern Recognition, AAST, Alexandria, Egypt.
           BSc Computer Science,2004, The Islamic University of Gaza, Palestine.
           General Secondary School Certificate,1999, Science division, Khan Younis, Gaza, Palestine.
       Currently :
           Head Of computer science & information technology department.
           Head of ITF3
           Board member of PICTA
       Past:
           Head Of Computer Center in CST (9-2009 To 10-2010)
           Head of ITF1 & ITF2
           Lecturer in QOU, UP, CST and UCAS as Part Time
       Contacts:
           E-mail: m.farra@cst.ps Site: http://www.cst.ps/staff/mfarra
           YouTube channel: mralfarra1 FaceBook Page: mahmoudRfarra
    Course’s Assignment
3
    How to be successfully ?!
4


       Prepare your lectures.
       Re-study them.
       Have a mood.
       Choose your friends.
       Try to under stand using any tool
       Ask Allah .
    Course’s Out Lines
5


       Introduction
       Data Preparation and Preprocessing
       Classification Methods
       Evaluation
       Clustering Methods
       Mid Exam
       Association Rules
       Knowledge Representation
       Special Case study : Document clustering
       Discussion of Case studies by students
    Out Lines
6



       Definition of Data Mining

       Need for Data Mining

       Data Mining Tasks/Challenges

       Data Mining as an Interdisciplinary field

       Process of Data Mining
    Definition of Data Mining
7



       What is KDD or "Knowledge discovery from
        databases"?

       "A non-trivial process of identifying valid, novel,
        useful and ultimately understandable patterns in
        data".
    Data pyramid
8




          Wisdom       Knowledge + experience


         Knowledge         Information + rules

                                Data + context
         Information
            Data
    Definition of Data Mining (Example)
9


       Consider for example, the following table that
        contains data about objects; shape, color, and
        weight.

                                Row #   Shape     Color    Weight
       Pattern
           Most Boxes are Red.
                               1->      Box     Red       100
       We can represent Pattern 2->    Box     Red       200
        as rule:                 3->    Box     Red       300
       If Shape = Box           4      Box     Blue      400
            then Color = Red.
                               5        Cone    Blue      400
     Data Mining and Business Intelligence
10

     Increasing potential
     to support
     business decisions                                                       End User
                                          Decision
                                          Making

                                   Data Presentation                          Business
                                                                               Analyst
                                  Visualization Techniques
                                       Data Mining                                  Data
                                      Information Discovery                       Analyst

                                     Data Exploration
                      Statistical Summary, Querying, and Reporting

                Data Preprocessing/Integration, Data Warehouses
                                                                                    DBA
                                   Data Sources
          Paper, Files, Web documents, Scientific experiments, Database Systems
     Need for Data Mining(1)
11



        Large quantities of data is being accumulated.

        Data could be large in two senses.
            In terms of size, e.g. for Image Data

          or   in terms of dimensionality, e.g. for Gene expression
             data.
     Need for Data Mining(2)
12



        A huge gap from the stored data to the knowledge
         that could be construed from the data.

        Data analysis for large data analysis.

        New demands, Data Mining techniques are now
         being applied to all kinds of domains.
     Data Mining Tasks
13



        Data mining tasks are the kind of data patterns
         that can be mined.

        Data Mining functionalities are used to specify the
         kind of patterns to be found in the data mining
         tasks.
     Data Mining Tasks
14



        In general data mining tasks can be classified into
         two categories:
          Descriptive   mining tasks characterize the general
           properties of the data.

          Predictive   mining tasks perform inferences on the
           current data in order to make predictions.
     Data Mining Tasks
15



        Most famous data mining tasks:
          Classification   [Predictive]

          Prediction   [Predictive]

          Association   Rules [Descriptive]

          Clustering   [Descriptive]

          Outlier Analysis   [Descriptive]
     Data Mining challenges
16



        Scalability: Scalable techniques are needed to
         handle the massive scale of data

        Dimensionality: Many applications may involves a
         large number of dimensions (e.g. features or
         attributes of data)
     Data Mining challenges
17



        Heterogeneous and Complex Data: In recent
         years complicated data types such as graph-based,
         text-free and structured data types are introduced.
         Techniques developed for data mining must be able
         to handle the heterogeneity of the data.
     Data Mining challenges
18



        Data Quality: Many data sets are imperfect due to
         present of missing values and noise un the data. To
         handle the imperfection, robust data mining
         algorithms must be developed.
     Data Mining challenges
19



        Data Distribution: As the volume of data
         increases , it is no longer possible or safe to keep
         all the data in the same place. As a result , the need
         for distributed data mining techniques has
         increased over the years.
     Data Mining challenges
20



        Privacy Preservation: While privacy intends to
         prevent the disclosure of information, data mining
         attempts to revel interesting knowledge about data.
         As a result, there is growing interest in developing
         privacy-preserving data mining algorithms
     Data Mining as an Interdisciplinary field
21




                Database                 Statistics



     Machine
                           Data Mining                Visualization
     Learning



           Artificial                         Other
          Intelligence                      Disciplines
     Data Mining as an Interdisciplinary field
22


         Statistics: Data Mining in Statistics deals with
         finding useful patterns in data sets.

        Relational Databases: Database part of data
         Mining that provide the fast and reliable access to
         data.
          Itused for data operation (Storing and retrieving data),
           Data Mining for Decision making.
     Data Mining as an Interdisciplinary field
23



        Artificial Intelligence: Knowledge acquisition,
         maintenance and application are other branches of
         Artificial Intelligence, which are highly related with
         Databases and also with Data Mining.
     Data Mining as an Interdisciplinary field
24



        Machine      Learning:       focuses       on   complex
         representations and search methods for specialized
         data-intensive problems.
          Data   Mining uses methods from Machine Language
           such as decision tree and neural nets.
     Data Mining as an Interdisciplinary field
25



        Visualization : is used to gain visual insights into
         the structure of the data.
          Visualization   is abundantly used as a pre- and post-
           processing tool for data mining.
     Data Mining as an Interdisciplinary field
26


        Knowledge Representation
            Knowledge presentation is the framework that converts
             a large amount of data into a particular data or
             procedure that human being can figure out based on an
             intention.

            In Knowledge representation visualization tools and
             knowledge representation techniques are used to
             present the mined knowledge to the user.
     Process of Data Mining
27


        Data Mining is a process rather than a plug-and-
         play.
       Process of Data Mining
28


          Data Mining is a                     Pattern Evaluation
           process rather than a
           plug-and-play.                Data Mining

                          Task-relevant Data


           Data Warehouse          Selection


     Data Cleaning

               Data Integration


              Databases
     Process of Data Mining
29


     1- Data cleaning:
     Real-world data tends to be incomplete, noisy and
       inconsistent.
      incomplete: lacking attribute values, lacking certain
       attributes of interest,
            e.g., occupation=“ ” (missing data)
        noisy: containing noise, errors, or outliers
          e.g.,   Salary=“−10” (an error)
        inconsistent: containing discrepancies in codes or
         names, e.g.,
          e.g., Age=“42”    Birthday=“03/07/1997”
     Process of Data Mining
30


     2- Data Integration:
     Data integration is the merging of data from multiple
       sources. These sources may include multiple
       databases, data cubes, or flat files.
     Process of Data Mining
31


     3- Data Selection:
     Where data relevant to the analysis task are retrieved
       from the database. Therefore, irrelevant, weakly
       relevant or redundant attributes may be detected
       and removed.
     Process of Data Mining
32


     4- Data Transformation
     Where data are transformed or consolidated into
       forms appropriate for mining by performing
       summary or aggregation operation (for example
       daily sales may be aggregated to monthly sales or
       annual sales), Generalization (for example, city
       may be generalized to country or age may
       generalized to young , middle- age, senior) .
     Process of Data Mining
33


     5- Data Mining:
     An essential process where intelligent methods are
       applied on data to covert it to knowledge in for
       decision making. Wide range of methods can be
       used in data mining such neural nets, decision tree
       and Association.
     Process of Data Mining
34



     6- Pattern evaluation :
     To identify the truly interesting pattern based on some
      interestingness measures. A pattern consider
      interesting if it is:
            Valid

          Novel

            Actionable

            Understandable
     Thanks
35

						
Related docs
Other docs by MahmoudAlfarra
Preparation & Preprocessing
Views: 35  |  Downloads: 1
INTRODUCTION TO DATA MINING
Views: 72  |  Downloads: 0