Microsoft PowerPoint - IKnow07 talk v2

Document Sample
Microsoft PowerPoint - IKnow07 talk v2 Powered By Docstoc
					Emerging Data Mining Applications: Advantages and Threats
Narayanan Kulathuramaiyer and Hermann Maurer
Graz University of Technology nara@iicm.edu

TRIPLE-I 2007, Graz September 5-7, 2007

Background
• What the text book says:
– Data Mining (DM) is the Extraction of interesting (non trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

We take a broader view

Reference: Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques

Life Cycle of Technology Adoption

Source: Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques

Data Mining needs DATA!

The Power of Data
• Credit card companies: Fraud Detection • Frequent Flyer Programmes: Targeted Marketing • Walmart: Better Logistics Management • Political Parties: Predicting Likelihood of Votes • US Government: Counter Terrorism

Potential Applications of DM
– Heart Attack Early Prediction – Sudden Death Syndrome – Detect Genetic Patterns Family Medical data – Avalanche Occurrence Predictions – Radioactivity and Cancer – Counter Terrorism

But do we understand (have a structured view of) the domain problem?

Data Mining Process

Knowledge Discovery Process (Fayyad et al. 1996)

Data
Target Data Preprocessed Data Transformed Data Paterns

DM Phases
Data Cleaning Data Selection Data Preprocessing

Domain focusing

Transaction Identification Data Integration

Model Construction

Data Transformation Pattern Discovery

Transformation Mining Interpretation-Eval Traditional

Decision Making

Pattern Analysis
Unstructured Domain

Abstract View based on Fayyad et al. 1996, Coole 1997, Jenssen 2002, Mobasher 2005,

Data Mining Algorithms
• • • • • Extraction of classification patterns Clustering Associational pattern mining Mathematical data modelling Sequential pattern mining

Emerging Data Mining Applications
Environmental Modeling(Protect against natural disasters)
– Early Warning Systems – Identification of safe and in danger sites susceptible to natural disasters

Ideal instrument not found due to Lack of Understanding of Underlying Structural Patterns Explore New Data Sources:
– Sensory inputs which includes multimedia data and sensor network inputs – After-event data from past events – Consider Weather parameters, terrestrial parameters and human induced parameters such as vegetation and deforestation

Environmental Mining: Early Warning systems
• Domain Focusing
– Problem Detection – Finding Deterministic Factors – Hypothesize Relationships
P1 O1 O2 E1

• Model Construction
– Validate relationships

• Decision Making
– Apply validated causal links in life situations »Beulens 2006

Emerging Data Mining Applications
Medical Applications
– Medical mysteries: unresolved medical problems – Better understanding of human system – Reduce Deaths due to Medical Errors

Explore Non-Traditional Data Sources: (non clinical data)
– – – – – – Retail data for determining drug purchase patterns Calls to emergency room Auxiliary Data such as Micro arrays in genomic databases Social Behavior Environmental Parameters Major Emotional States

Social and behavioural Patterns Social Discourses

Metadata

Web Contents

Mining & Analysis Tools

Documents HyperLinks Layered personalised Maze of intrinsic Pattern discoveries
Web Usage Log

Media tags Query Log

Community Space

They know that...
• .....I like football,My favourite books, I don‘t know plumbing, My dentist‘s name, the sofa I use, the movies I watch • .....My company, profile, projects completed, course I teach, student appraisal, my talents, capacity • ....My ideas, my published works, my patents, my worth • ....My business deals, my expenditure on transactions, associates, • ....Unfinished deals, intended outings, planned activities of the future • .....Intentions, Life dreams, career plans • ...what I may not know about myself and others

Intention Mining
Sample Search queries: • finance major vs. accounting major • shoes from India • what does vintage mean • baby looney tunes baby shower party invitations • how many web pages are on the internet • feline heartworm remedies • salaries for forensic accountants • hamilton county chinese stores that sell chinese candy white rabbit • how do you read the stock market or invest in stock • on the basis of the settlement of unfortunate past and the outstanding issues of concern • riverview towers apartment information in Pittsburg Reference: http://www.aolsearchdatabase.com/

•

Heer,J., Boyd,D., 2007, Vizster: Visualising Online Social Networks, IEEE Symposium on Information Visualisation, Website: http://www.danah.org/papers/InfoViz2005.pdf

Trancer, B. 2007, July Unemployment Numbers (U.S.) - Calling All Economists Website:http://weblogs.hitwise.com/bill-tancer/2006/08/july_unemployment_numbers_us_c.html

Data Cleaning Transaction Aggregation Data Integration

Search Pattern Modelling User Behaviour Modelling Profile Mining Profile Targetting Profile Consolidation Contentious Discoveries

User Profile Interest Aggregation Thread Aggregation Transaction Cluster-Profile Data Cosolidation Transformation Pattern Discovery

Data Transformation Pattern Discovery

Pattern Analysis

Pattern Analysis

Unbounded Domain e.g. Search Log

Connecting the Dots

Bounded Domain E-mail Content

Data Mining Advantages
• Transparent Helper for Well informed Decision Making • Knowledge-based Healthcare, Welfare, Security • We need not be good at everything • Provide answers to formerly unresolved problems

Disadvantage of DM
• Privacy • Subject to Non-planned Usage • Over-generalization
– Generalized profiles – False Positives

• Predictive Ability beyond our imagination

Mining Power of Web Search

Business Week Online Cover Story April 7, 2007

Search Engines knows so much about companies and economic developments due to data mining that Master Miners can buy stock of which it KNOWS will go up. Trading of stock only works because of the uncertainty involved. Google has no such uncertainty. Our economy is seriously threatened by this!

Reference: N. Kulathuramaiyer, H. Maurer Market Forces vs. Public Control of Basic Vital Services http://www.iicm.edu/market_vs_public.doc

Free Services Maps, Mashups Email Financial History, Query Log Community space Search log Check-in history Base, Adverts Web structure Calendar Community Software

Base Mining Results Context Profile Social Links & context Trends, Financial Profile Community Profile/Interest Personal/Global Intent/Behaviour User Profile Company/Product Profile Popular site Events, Plans Community Profile, Interest, Alliances

Connected Dots Local Business Trends Personal Customer Model Social Networks Event Prediction Current trends What’s the buzz? Likely purchase Intention of user Favourite Brands Incomplete Purchase Trade Insights Upcoming businesses Connecting people, events, context Detailed Profiles

What Can we do: Address Privacy
• Protecting Anonymity Using:
– Anonymity Agents – Pseudonym Agents – Negotiation and Trust Agents [Kovatcheva, 2002] – Intelligent intermediary agents – Rules for permitted access and use data – Data items have metadata on usage [Taipale,2003]

Challenges in Standardization and Checking for Compliance in Back-end systems

Data Mining can be a big danger.
Actions are necessary not to combat this kind of phenomenon: Large scale data mining (e.g. search engines ) are issues that cannot be left to free market forces without some Regulations. They belong to the category of activities where the public has to intervene, like it is doing in schooling, building roads, approving drugs, etc.
See: N. Kulathuramaiyer, H. Maurer Market Forces vs. Public Control of Basic Vital Services http://www.iicm.edu/market_vs_public.doc

Alternatives to Global Centralized Control
Organisations that are conducting data mining on a very large scale (whatever that means) have to be checked using anti-trust measures. Large Data Mining capability should be run by non-profit organisations whose objectivity can be checked by public agencies. Distributed data mining facilities can be established to avoid Data Control by single agency
See Kulathuramaiyer, N., Maurer, H., "Why is Fighting Plagiarism and IPR Violation Suddenly of Paramount Importance?" www.iicm.tu-graz.ac.at/iicm_papers/why_is_fighting_plagiarism_of_importance.doc -

Reflections
• • • • • Data Miners need More Data It is easier to collect Data than to Analyse Data is conferring Power on a Few Data is a Commodity Unchecked Mining Can be Dangerous

Conclusion
• Data Mining Applications are going to affect our life in a bigger way • We need to become aware of the potentials and threats • We can do something about it
– Medical data collection requires permission by an ethics commission for specific purposes. – Anti-trust laws have been applied to stop unfair monopolies before – There are alternatives to global search engines


				
DOCUMENT INFO