INTRODUCTION TO DATA MINING
Document Sample


College of Science & Technology
Dep. Of Computer Science & IT
BCs of Information Technology
Data Mining
Chapter 1: Introduction
2013 Prepared by: Mahmoud Rafeek Al-Farra
www.cst.ps/staff/mfarra
Lecturer
2
Mahmoud Rafeek Alfarra
Certificates:
MSc Computer Science,2008, Pattern Recognition, AAST, Alexandria, Egypt.
BSc Computer Science,2004, The Islamic University of Gaza, Palestine.
General Secondary School Certificate,1999, Science division, Khan Younis, Gaza, Palestine.
Currently :
Head Of computer science & information technology department.
Head of ITF3
Board member of PICTA
Past:
Head Of Computer Center in CST (9-2009 To 10-2010)
Head of ITF1 & ITF2
Lecturer in QOU, UP, CST and UCAS as Part Time
Contacts:
E-mail: m.farra@cst.ps Site: http://www.cst.ps/staff/mfarra
YouTube channel: mralfarra1 FaceBook Page: mahmoudRfarra
Course’s Assignment
3
How to be successfully ?!
4
Prepare your lectures.
Re-study them.
Have a mood.
Choose your friends.
Try to under stand using any tool
Ask Allah .
Course’s Out Lines
5
Introduction
Data Preparation and Preprocessing
Classification Methods
Evaluation
Clustering Methods
Mid Exam
Association Rules
Knowledge Representation
Special Case study : Document clustering
Discussion of Case studies by students
Out Lines
6
Definition of Data Mining
Need for Data Mining
Data Mining Tasks/Challenges
Data Mining as an Interdisciplinary field
Process of Data Mining
Definition of Data Mining
7
What is KDD or "Knowledge discovery from
databases"?
"A non-trivial process of identifying valid, novel,
useful and ultimately understandable patterns in
data".
Data pyramid
8
Wisdom Knowledge + experience
Knowledge Information + rules
Data + context
Information
Data
Definition of Data Mining (Example)
9
Consider for example, the following table that
contains data about objects; shape, color, and
weight.
Row # Shape Color Weight
Pattern
Most Boxes are Red.
1-> Box Red 100
We can represent Pattern 2-> Box Red 200
as rule: 3-> Box Red 300
If Shape = Box 4 Box Blue 400
then Color = Red.
5 Cone Blue 400
Data Mining and Business Intelligence
10
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Need for Data Mining(1)
11
Large quantities of data is being accumulated.
Data could be large in two senses.
In terms of size, e.g. for Image Data
or in terms of dimensionality, e.g. for Gene expression
data.
Need for Data Mining(2)
12
A huge gap from the stored data to the knowledge
that could be construed from the data.
Data analysis for large data analysis.
New demands, Data Mining techniques are now
being applied to all kinds of domains.
Data Mining Tasks
13
Data mining tasks are the kind of data patterns
that can be mined.
Data Mining functionalities are used to specify the
kind of patterns to be found in the data mining
tasks.
Data Mining Tasks
14
In general data mining tasks can be classified into
two categories:
Descriptive mining tasks characterize the general
properties of the data.
Predictive mining tasks perform inferences on the
current data in order to make predictions.
Data Mining Tasks
15
Most famous data mining tasks:
Classification [Predictive]
Prediction [Predictive]
Association Rules [Descriptive]
Clustering [Descriptive]
Outlier Analysis [Descriptive]
Data Mining challenges
16
Scalability: Scalable techniques are needed to
handle the massive scale of data
Dimensionality: Many applications may involves a
large number of dimensions (e.g. features or
attributes of data)
Data Mining challenges
17
Heterogeneous and Complex Data: In recent
years complicated data types such as graph-based,
text-free and structured data types are introduced.
Techniques developed for data mining must be able
to handle the heterogeneity of the data.
Data Mining challenges
18
Data Quality: Many data sets are imperfect due to
present of missing values and noise un the data. To
handle the imperfection, robust data mining
algorithms must be developed.
Data Mining challenges
19
Data Distribution: As the volume of data
increases , it is no longer possible or safe to keep
all the data in the same place. As a result , the need
for distributed data mining techniques has
increased over the years.
Data Mining challenges
20
Privacy Preservation: While privacy intends to
prevent the disclosure of information, data mining
attempts to revel interesting knowledge about data.
As a result, there is growing interest in developing
privacy-preserving data mining algorithms
Data Mining as an Interdisciplinary field
21
Database Statistics
Machine
Data Mining Visualization
Learning
Artificial Other
Intelligence Disciplines
Data Mining as an Interdisciplinary field
22
Statistics: Data Mining in Statistics deals with
finding useful patterns in data sets.
Relational Databases: Database part of data
Mining that provide the fast and reliable access to
data.
Itused for data operation (Storing and retrieving data),
Data Mining for Decision making.
Data Mining as an Interdisciplinary field
23
Artificial Intelligence: Knowledge acquisition,
maintenance and application are other branches of
Artificial Intelligence, which are highly related with
Databases and also with Data Mining.
Data Mining as an Interdisciplinary field
24
Machine Learning: focuses on complex
representations and search methods for specialized
data-intensive problems.
Data Mining uses methods from Machine Language
such as decision tree and neural nets.
Data Mining as an Interdisciplinary field
25
Visualization : is used to gain visual insights into
the structure of the data.
Visualization is abundantly used as a pre- and post-
processing tool for data mining.
Data Mining as an Interdisciplinary field
26
Knowledge Representation
Knowledge presentation is the framework that converts
a large amount of data into a particular data or
procedure that human being can figure out based on an
intention.
In Knowledge representation visualization tools and
knowledge representation techniques are used to
present the mined knowledge to the user.
Process of Data Mining
27
Data Mining is a process rather than a plug-and-
play.
Process of Data Mining
28
Data Mining is a Pattern Evaluation
process rather than a
plug-and-play. Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
Process of Data Mining
29
1- Data cleaning:
Real-world data tends to be incomplete, noisy and
inconsistent.
incomplete: lacking attribute values, lacking certain
attributes of interest,
e.g., occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or
names, e.g.,
e.g., Age=“42” Birthday=“03/07/1997”
Process of Data Mining
30
2- Data Integration:
Data integration is the merging of data from multiple
sources. These sources may include multiple
databases, data cubes, or flat files.
Process of Data Mining
31
3- Data Selection:
Where data relevant to the analysis task are retrieved
from the database. Therefore, irrelevant, weakly
relevant or redundant attributes may be detected
and removed.
Process of Data Mining
32
4- Data Transformation
Where data are transformed or consolidated into
forms appropriate for mining by performing
summary or aggregation operation (for example
daily sales may be aggregated to monthly sales or
annual sales), Generalization (for example, city
may be generalized to country or age may
generalized to young , middle- age, senior) .
Process of Data Mining
33
5- Data Mining:
An essential process where intelligent methods are
applied on data to covert it to knowledge in for
decision making. Wide range of methods can be
used in data mining such neural nets, decision tree
and Association.
Process of Data Mining
34
6- Pattern evaluation :
To identify the truly interesting pattern based on some
interestingness measures. A pattern consider
interesting if it is:
Valid
Novel
Actionable
Understandable
Thanks
35
Get documents about "