# Parallel and Distributed Computing for Data Mining

Document Sample

```					 Parallel and Distributed
Computing for Data Mining

Mohammad Alshamimi
April 28, 2011
Outline
► What   is data mining
► Techniques of Data Mining
► Distributed Data Mining
► Example: Credit Card Fraud Detection
► Knowledge Grid
What is data mining?
► Data  Mining is discovery of valuable
information from large data volumes, using
computationally efficient techniques.
► Some call data mining “Knowledge
discovery”
Techniques in Data Mining
► Clustering
► Association  rules
► Classification
► Sequential pattern
► Outlier detection
► Decision Trees
Techniques in Data Mining:
Clustering
► Clustering   is the process of partitioning or
grouping a given set of data points into
distinct groups or clusters based on the
similarity. Similar data sets are in same
cluster
 E.g. group similar companies with similar stock
behavior or similar growth to identify genes and
proteins that have similar functions
Techniques in Data Mining:
Association
► Associationrule is to find all rules that
correlate the presence of one set of items
with that of another set of items.
 E.g. identify the items that sell together in a
supermarket from mining the sales transactions
Techniques in Data Mining:
Classification
► Classificationis to assign objects to
predefined categories or classes
 E.g. credit-evaluation, classification can be good
or bad
Techniques in Data Mining:
Sequential pattern discovery
► Sequential  pattern discovery determines
strong sequential dependencies among
different events
 E.g. medical diagnosis or sales-transactions
analysis to determine which customers are likely
to buy a specific product in the future
Techniques in Data Mining:
Detection of outliers
► Outlier  detection finds data points that
differ significantly from the majority of the
data points in a given data set
 E.g. medical diagnosis and credit card fraud
detection
Techniques in Data Mining: Decision
Trees
► Decision     Trees is a classifier where a a tree
is built of decision points that are effective
to split remaining data
► example: decision to play golf
► Attribute:
 Outlook
 Windy
 Humidity
Distributed Data Mining (DDM)
► Goal:
 to increase the knowledge about the promote
of benefits of using parallel and distributed
computing platforms to solve problems in data-
mining applications
 Discover hidden pattern in complete data set
that is partitioned and physically distributed
► Better   than centralized DM in:
 Privacy
 Data Transmission bandwidth
Distributed Data Mining (DDM)
► Can be done as services at each database
and a global service controller
► Global brokering service coordinates a
group of expert services
 Each service perform local analysis on a
particular data partition
► Theglobal service performs further analysis
based on local results to integrate a global
results
Distributed Data Mining (DDM)
► Challenges:
 Heterogeneity:
►each  node or data source may not have the same
system, same hardware
 Autonomy
►Each node has its own control over data and
management
Credit Card Fraud Detection
► Credit card becomes the standard way for
web and e-commerce payment
► Risk of fraud transactions is very high
► Data mining is used to detect the fraud
transactions
Credit Card Fraud Detection
Challenges facing DM in this problem:
► Billions of credit card transactions processed daily
(Massive amount of data)
► Data highly skewed
 Many more transactions are legitimate than fraud
► Each   transaction record has different amount
 Variable potential loss, not fixed misclassification cost
Credit Card Fraud Detection
► Approach:
 Prepare set of test labeled data
 Divide large known data labeled data into smaller
subsets
 distribute sets to different processors
 Each processors create local classifier
 Integrate global classifier using meta learning
► Use classifier to classify new transactions
► Process repeated periodically
► Classifier can be decision tree, neural network       or
other types of classifiers
Credit Card Fraud Detection
► This approach may results in large amount
of local classifiers
► Pruning techniques is used to remove
redundant classifiers
Distributed Data Mining (DDM) in the
Grid
► Can think of such environment as Data Grids
► Grids:
Geographically distributed platform with Heterogeneous
machines accessible by a single interface
► Data   Grid:
 Grids designed to allow large data sets to be stored and
moved easily
 Handle data sets without constant or repeated
authentication
 Support distributed data-intensive applications
DM in the Grid
► Motivation     to have data grid that is:
   high performance
   Secure
   Robust data transfer mechanism
   With
► Set of tools for creating and manipulating replicas of large data
sets
► A mechanism for maintaining a catalog of data set replicas

► Knowledge   Grid is introduced to fulfill these
infrastructure
Knowledge Grid
► Knowledge  Grid: defined on top grid toolkit
and services
► Knowledge Grid can
 be used to perform data mining on very large
data sets available over grids
 Make scientific discoveries
 Improve industrial processes
 Uncover business valuable information
Knowledge Grid
are two hierarchic level for
► There
Knowledge Grid:
 Core K-grid layer
 High level k-grid layer
Knowledge Grid
Knowledge Grid: Core K Grid
► Core  K Grid layer offers basic services for
definition, composition and execution of
distributed knowledge discovery
► Main services:
- Knowledge Directory Service
-   Manage metadata and tools of knowledge
- Resource Allocation & Execution management
-   Find the best mapping between execution plan and
available recourses to achieve application
requirements
Knowledge Grid: High Level K Grid

► High   Level K Grid include services to:
 Compose
 Validate
 Execute
Parallel and distributed knowledge discovery
computation
► Alsoit provide services to store and analyze
discovered knowledge
Knowledge Grid: High Level K Grid
► Main   Services:
 Data Access Service (DAS)
► Search,  selection, extraction, transformation, delivery of data to
be mined
 Tools and algorithms access service (TAAS)
► Search,   selection, downloading of data mining tools
 Execution plan management service (EPMS)
► Semi-automatic     tool that takes data and programs and generate
different execution plan
 Results presentation service (RPS)
► Generate,   present, visualize, store knowledge models
Conclusion
► The  need to transfer data into knowledge is very
demanding
► Data mining is about discovering valuable
information
► Two main components needed by data mining:
 Data
 Efficient algorithms
► Parallelcomputing can leads to very efficient
algorithms if used in data mining
Conclusion
► Data  Mining in Grid or Knowledge Grid is
very efficient solution that distribute the
process of data mining
► It also keep the privacy of data since each
data source will be responsible for the
computation of its own data
Refrences
►   Cannataro, M., Talia, D., Trunfio, P. “Distributed data mining on the
grid”. Future Generation Computer Systems 18 (2002) 1101–1112

►   Fran, W. Distributed data mining in credit card fraud detection. IEEE
intelligent systems & their applications 6 (14) (2000) 67

►   Lou P., Lu K., Shi, Z., He, Q. “Distributed data mining in grid
computing environments”. Future Generation Computer Systems 23
(2007) 84–91

►   Talia, D. , Trunfio, P. “How Distributed Data Mining Tasks can Thrive
as Knowledge Services” 7(53) (2010) 132-137

►   Zomaya, A. , El-Ghazawi, T., Frieder, O. “Parallel and Distributed
Computing for Data Mining”. IEE concurrency (October-
December)(1999) 11-13

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 9 posted: 6/10/2011 language: English pages: 28
How are you planning on using Docstoc?