Introduction to Data Mining

Reviews
Predictive Tax Compliance Presentation to the IRS SPSS SRA Benjamin Chard Senior Solution Engineer bchard@spss.com Sarah Mattingly IRS Account Executive smattingly@spss.com Ted Fischer Project Manager ted_fischer@sra.com or theodore.i.fischer@irs.gov 301-731-3534 Agenda     Introduction to Data Mining Predictive Tax Compliance Using Clementine for Audit Selection What’s New in Clementine Version 11.1  IRS Refund Fraud Detection Project Case Study Where Does Data Mining Fit? Existing Data •Historical Claims •Current Claims Build Models Data Mining Workbench Operational Setting •Reporting •Case Mgt •Claim Scoring „Data Mining‟ vs. „Query/Reporting‟  Reporting (Tables, Graphics, OLAP)  Provide you with a very good view of what is happening, but within a limited view of the data and only in models defined by the user 600 A&B Assault 500 B&E 400 carjacking Larceny 300 Murder 200 MV Rape 100 Robbery 0 1999 2000 2001 other Incident Count - by day and shift Count Sunday Monday Tuesday Wednesday Thursday Friday Saturday 00-04 48 25 21 29 38 33 45 04-08 15 39 27 38 50 40 21 08-12 43 101 106 101 105 88 52 12-16 62 131 179 177 168 147 82 16-20 73 199 191 177 197 209 116 20-24 68 100 102 103 107 107 112 Count YEAR „Statistics‟ vs. „Data Mining‟ Statistics: Hypothesis Testing What is Data Mining?  Three classes of data mining algorithms: What events occur together? Given a series of actions; what action is likely to occur next? Cluster “Differences” Group cases that exhibit similar characteristics. Data Mining Predict “Relationships” Associate “Patterns” Predict who is likely to exhibit specific behavior in the future. Predictive Tax Compliance Predictive Tax Compliance Register Assess Collect Non-Filer Discovery •Soft-Matching •Prioritization Models Audit Selection • Audit Models Tax Collection •Risk Models DATA MINING & PREDICTIVE ANALYTICS TOOLS DATA WAREHOUSE Right work to the right resources at the right time Predictive Modeling  Building a predictive profile of the claim that after investigation was flagged as an improper payment regardless of amount. Select positive investigations Maximize those claims with the highest dollar adjustment found per audit hour. Minimize the number of nochange audits. Credit ranking (1=default) Cat.Credit ranking (1=default) % n Bad 52.01 168 Good 47.99 155 Cat. % n Total (100.00) 323 Bad 52.01 168 Good 47.99 Paid Weekly/Monthly 155 Total (100.00) 323 P-value=0.0000, Chi-square=179.6665, df=1 Weekly pay Cat. % n Weekly pay Bad 86.67 143 Good 13.33 22 Cat. % n Total (51.08) 165 Bad 86.67 143 Good 13.33 22 Age Categorical Total (51.08) 165df=1 P-value=0.0000, Chi-square=30.1113, Age Categorical Young (< 25);Middle (25-35) Old ( > 35) P-value=0.0000, Chi-square=30.1113, df=1 Cat. % n Young (< 25);Middle (25-35) Bad 90.51 143 Good 9.49 15 Cat. 158 n Total (48.92) % Bad 90.51 143 Good 9.49 15 Total (48.92) 158 Paid Weekly/Monthly Monthly salary P-value=0.0000, Chi-square=179.6665, df=1 Cat. % n Monthly salary Bad 15.82 25 Good 84.18 133 Cat. % n Total (48.92) 158 Bad 15.82 25 Good 84.18 133 Age Categorical Total (48.92) 158df=1 P-value=0.0000, Chi-square=58.7255, Age Categorical Young (< 25) Middle (25-35);Old ( > 35) P-value=0.0000, Chi-square=58.7255, df=1 Cat. % n Middle (25-35);Old ( > 35) Bad 0.92 1 Good 99.08 108 Cat. 109 n Total (33.75) % Bad 0.92 1 Good 99.08 108 Total (33.75) 109 Cat. % n Cat. % n Old ( > 35) Young (< 25) Bad 0.00 0 Bad 48.98 24 Good 100.00 7 Good 51.02 25 Cat. % n Cat. % n Total (2.17) 7 Total (15.17) 49 Bad 0.00 0 Bad 48.98 24 Good 100.00 7 Good 51.02 25 Social Class Total (2.17) P-value=0.0016, Chi-square=12.0388, df=1 7 Total (15.17) 49 Social Class Management;Clerical Professional P-value=0.0016, Chi-square=12.0388, df=1 Cat. % n Management;Clerical Bad 0.00 0 Good 100.00 8 Cat. % n Total (2.48) 8 Bad 0.00 0 Good 100.00 8 Total (2.48) 8 Cat. % n Professional Bad 58.54 24 Good 41.46 17 Cat. % n Total (12.69) 41 Bad 58.54 24 Good 41.46 17 Total (12.69) 41  Anomaly Detection  Find emerging trends in claims data. Use data mining to show the emerging patterns in current year data. Reported results will present specific cases that either :  Exhibit a common pattern or  Exhibit an unusual pattern  Unusual cases are deployed to the field investigators for further Case Study: Audit Selection Goals  Build models to predict different outcomes.    Positive Adjustment (Y/N). DPH group membership. Actual $$ Adjustment. Cases with Prior audit – prior audit and organizational data. All Cases – organizational data only. For each outcome combine predictions for those with and without previous audit data . For each outcome predict using organizational data only.  Historical Cases selected for model build    Deployment   Clementine Workbench Case Study: Results Text Mining and Linguistic Extraction Text Mining Timeline: Text Extraction “Mr. Smith aka Mr. Ahmed was seen on the corner of Church St. and Magnolia Ave. on Nov 13th” Bag of « Words » extraction Expressions extraction Mr. Smith aka was seen with Ahmed on the corner of Church Etc. Mr. Smith (Person) -> aka (Alias) -> Mr. Ahmed (Person) was seen (location) -> Church and Magnolia (address) -> November 13 (Date) Mr. Smith was seen Mr. Ahmed corner Church St. Magnolia Ave. Nov 13th Named Entities extraction Mr. Ahmed in database wanted for questioning Suspect -> send agent to this location Mr. Smith -> Person Mr. Ahmed-> Person aka -> Alias was seen -> location Church St. -> Address Magnolia Ave. -> Address Nov 13th -> Date Events/Sentiment Extraction Combined with structured data Now 70’s 80’s 90’s Text Mining Management  General Dictionaries Organization, Location, Name, Phone Number, etc  Custom Built Subject Dictionaries Interactive Synonym Dictionaries Exclude Dictionaries NEW!: Classification algorithms enable you to aggregate concepts from a wide variety of unstructured text data and group them into a small number of categories. Tax Code, Form Names, Commodity, Business, etc    What‟s New Binary Classifier – Automation of Many Models   Sophisticated users: hundreds of models (scripting) Binary Classifier Node imitates this…  …but easily, with a pre-built node Time Series Algorithm  ARIMA & Exponential Smoothing    Expert Modeler – finds best model automatically Forecast Multiple Series at once Data Preparation Tools Optimal Binning   Splitting up numeric data into sub-ranges New capability to make this optimal for prediction Existing Capability – Equal bins New Capability – Optimal bins SPSS Reporting  SPSS Statistics and Graphs Within Clementine Configuration Management Predictive Enterprise Services (PES) Top Four Audit Audit Process Process Audit Audit Selection Selection Analytical Analytical DataStorage Data Storage Data Data Mining Mining Deployment and Integration  Configuration Management  Exporting Data, Models and Streams  Explore and Describe 1. Improve Collaboration  In single project there is the potential to create a large number of models and versions of models:     different out variables different algorithms different settings different training samples. X # different data sets X # different users X # different locations. 2. Improve Transparency  Provide information on which models are run on which data. For audit standards, track who has made changes to the model and when. Your analytics team from their desktop can see which models were most recently run on data, so that they would be able to provide this for internal audits.  3. Automate Process  Combine Clementine, SPSS, SAS & other processes  Scheduling & notification 4. Centralize and Control Access Contact information  Project personnel:   Ted Fischer – ted_fischer@sra.com or theodore.i.fischer@irs.gov, 301-731-3534 Anthony Colyandro – anthony_colyandro@sra.com or anthony.colyandro@irs.gov, 301-731-3524 Dave Vennergrund – dave_vennergrund@sra.com, 703-803-1614  SRA Director of Business Intelligence  How do I get SPSS software? IRS Cathy J. Allen Enterprise System Management Software Management Section Idea Branch - MS 5850 (304) 264-7279 - voice (304) 279-5309 - cell (304) 260-3033 - fax cathy.j.allen@irs.gov SPSS Contacts: Account Executive – Sarah Mattingly Email: smattingly@spss.com W – 703-740-2446 C – 703-389-6485 Account Manager – Matt Madden W - 312 651 3894 Predictive Tax Compliance Presentation to the IRS SPSS SRA Benjamin Chard Senior Solution Engineer bchard@spss.com Sarah Mattingly IRS Account Executive smattingly@spss.com Ted Fischer Project Manager ted_fischer@sra.com or theodore.i.fischer@irs.gov 301-731-3534

Related docs
Data Mining Introduction
Views: 63  |  Downloads: 17
Data_Mining
Views: 105  |  Downloads: 31
Introduction to Data Mining
Views: 128  |  Downloads: 29
DATA MINING
Views: 60  |  Downloads: 15
Data Mining Introduction
Views: 0  |  Downloads: 0
data_mining_concepts_and_techniques
Views: 195  |  Downloads: 32
An Introduction to the WEKA Data Mining System
Views: 208  |  Downloads: 19
A Short Introduction to Sequential Data Mining
Views: 60  |  Downloads: 6
A Short Introduction to Sequential Data Mining
Views: 80  |  Downloads: 2
Top 10 Data Mining Algorithms
Views: 1931  |  Downloads: 81
Data Mining Preprocessing
Views: 37  |  Downloads: 3
A Data Mining Tutorial
Views: 463  |  Downloads: 44
Mining
Views: 8  |  Downloads: 1
Mining
Views: 64  |  Downloads: 16
Other docs by StuartSpruce