Predictive Tax Compliance
Presentation to the IRS
SPSS
SRA
Benjamin Chard Senior Solution Engineer bchard@spss.com
Sarah Mattingly IRS Account Executive smattingly@spss.com
Ted Fischer Project Manager ted_fischer@sra.com or theodore.i.fischer@irs.gov 301-731-3534
Agenda
Introduction to Data Mining
Predictive Tax Compliance
Using Clementine for Audit Selection What’s New in Clementine Version 11.1
IRS Refund Fraud Detection Project Case Study
Where Does Data Mining Fit?
Existing Data •Historical Claims •Current Claims
Build Models Data Mining Workbench
Operational Setting •Reporting •Case Mgt •Claim Scoring
„Data Mining‟ vs. „Query/Reporting‟
Reporting (Tables, Graphics, OLAP)
Provide you with a very good view of what is happening, but within a limited view of the data and only in models defined by the user
600
A&B Assault
500 B&E 400 carjacking Larceny 300 Murder 200 MV Rape 100 Robbery 0 1999 2000 2001 other
Incident Count - by day and shift Count Sunday Monday Tuesday Wednesday Thursday Friday Saturday 00-04 48 25 21 29 38 33 45 04-08 15 39 27 38 50 40 21 08-12 43 101 106 101 105 88 52 12-16 62 131 179 177 168 147 82 16-20 73 199 191 177 197 209 116 20-24 68 100 102 103 107 107 112
Count
YEAR
„Statistics‟ vs. „Data Mining‟
Statistics: Hypothesis Testing
What is Data Mining?
Three classes of data mining algorithms:
What events occur together? Given a series of actions; what action is likely to occur next?
Cluster
“Differences”
Group cases that exhibit similar characteristics.
Data Mining Predict “Relationships”
Associate “Patterns”
Predict who is likely to exhibit specific behavior in the future.
Predictive Tax Compliance
Predictive Tax Compliance
Register Assess Collect
Non-Filer Discovery •Soft-Matching •Prioritization Models
Audit Selection • Audit Models
Tax Collection •Risk Models
DATA MINING & PREDICTIVE ANALYTICS TOOLS
DATA WAREHOUSE
Right work to the right resources at the right time
Predictive Modeling
Building a predictive profile of the claim that after investigation was flagged as an improper payment regardless of amount. Select positive investigations Maximize those claims with the highest dollar adjustment found per audit hour. Minimize the number of nochange audits.
Credit ranking (1=default) Cat.Credit ranking (1=default) % n Bad 52.01 168 Good 47.99 155 Cat. % n Total (100.00) 323 Bad 52.01 168 Good 47.99 Paid Weekly/Monthly 155 Total (100.00) 323 P-value=0.0000, Chi-square=179.6665, df=1 Weekly pay Cat. % n Weekly pay Bad 86.67 143 Good 13.33 22 Cat. % n Total (51.08) 165 Bad 86.67 143 Good 13.33 22 Age Categorical Total (51.08) 165df=1 P-value=0.0000, Chi-square=30.1113, Age Categorical Young (< 25);Middle (25-35) Old ( > 35) P-value=0.0000, Chi-square=30.1113, df=1 Cat. % n Young (< 25);Middle (25-35) Bad 90.51 143 Good 9.49 15 Cat. 158 n Total (48.92) % Bad 90.51 143 Good 9.49 15 Total (48.92) 158 Paid Weekly/Monthly Monthly salary P-value=0.0000, Chi-square=179.6665, df=1 Cat. % n Monthly salary Bad 15.82 25 Good 84.18 133 Cat. % n Total (48.92) 158 Bad 15.82 25 Good 84.18 133 Age Categorical Total (48.92) 158df=1 P-value=0.0000, Chi-square=58.7255, Age Categorical Young (< 25) Middle (25-35);Old ( > 35) P-value=0.0000, Chi-square=58.7255, df=1 Cat. % n Middle (25-35);Old ( > 35) Bad 0.92 1 Good 99.08 108 Cat. 109 n Total (33.75) % Bad 0.92 1 Good 99.08 108 Total (33.75) 109
Cat. % n Cat. % n Old ( > 35) Young (< 25) Bad 0.00 0 Bad 48.98 24 Good 100.00 7 Good 51.02 25 Cat. % n Cat. % n Total (2.17) 7 Total (15.17) 49 Bad 0.00 0 Bad 48.98 24 Good 100.00 7 Good 51.02 25 Social Class Total (2.17) P-value=0.0016, Chi-square=12.0388, df=1 7 Total (15.17) 49
Social Class Management;Clerical Professional P-value=0.0016, Chi-square=12.0388, df=1 Cat. % n Management;Clerical Bad 0.00 0 Good 100.00 8 Cat. % n Total (2.48) 8 Bad 0.00 0 Good 100.00 8 Total (2.48) 8 Cat. % n Professional Bad 58.54 24 Good 41.46 17 Cat. % n Total (12.69) 41 Bad 58.54 24 Good 41.46 17 Total (12.69) 41
Anomaly Detection
Find emerging trends in claims data. Use data mining to show the emerging patterns in current year data. Reported results will present specific cases that either :
Exhibit a common pattern or Exhibit an unusual pattern
Unusual cases are deployed to the field investigators for further
Case Study: Audit Selection Goals
Build models to predict different outcomes.
Positive Adjustment (Y/N). DPH group membership. Actual $$ Adjustment. Cases with Prior audit – prior audit and organizational data. All Cases – organizational data only. For each outcome combine predictions for those with and without previous audit data . For each outcome predict using organizational data only.
Historical Cases selected for model build
Deployment
Clementine Workbench
Case Study: Results
Text Mining and Linguistic Extraction
Text Mining Timeline: Text Extraction
“Mr. Smith aka Mr. Ahmed was seen on the corner of Church St. and Magnolia Ave. on Nov 13th”
Bag of « Words » extraction
Expressions extraction
Mr. Smith aka was seen with Ahmed on the corner of Church Etc.
Mr. Smith (Person) -> aka (Alias) -> Mr. Ahmed (Person) was seen (location) -> Church and Magnolia (address) -> November 13 (Date)
Mr. Smith was seen Mr. Ahmed corner Church St. Magnolia Ave. Nov 13th
Named Entities extraction
Mr. Ahmed in database wanted for questioning Suspect -> send agent to this location
Mr. Smith -> Person Mr. Ahmed-> Person aka -> Alias was seen -> location Church St. -> Address Magnolia Ave. -> Address Nov 13th -> Date
Events/Sentiment Extraction
Combined with structured data
Now
70’s
80’s
90’s
Text Mining Management
General Dictionaries
Organization, Location, Name, Phone Number, etc
Custom Built Subject Dictionaries Interactive Synonym Dictionaries Exclude Dictionaries NEW!: Classification algorithms enable you to aggregate concepts from a wide variety of unstructured text data and group them into a small number of categories.
Tax Code, Form Names, Commodity, Business, etc
What‟s New
Binary Classifier – Automation of Many Models
Sophisticated users: hundreds of models (scripting) Binary Classifier Node imitates this…
…but easily, with a pre-built node
Time Series Algorithm
ARIMA & Exponential Smoothing
Expert Modeler – finds best model automatically
Forecast Multiple Series at once Data Preparation Tools
Optimal Binning
Splitting up numeric data into sub-ranges New capability to make this optimal for prediction
Existing Capability – Equal bins
New Capability – Optimal bins
SPSS Reporting
SPSS Statistics and Graphs Within Clementine
Configuration Management
Predictive Enterprise Services (PES) Top Four
Audit Audit Process Process
Audit Audit Selection Selection
Analytical Analytical DataStorage Data Storage
Data Data Mining Mining
Deployment and Integration
Configuration Management
Exporting Data, Models and Streams
Explore and Describe
1. Improve Collaboration
In single project there is the potential to create a large number of models and versions of models:
different out variables different algorithms different settings different training samples.
X # different data sets X # different users X # different locations.
2. Improve Transparency
Provide information on which models are run on which data. For audit standards, track who has made changes to the model and when.
Your analytics team from their desktop can see which models were most recently run on data, so that they would be able to provide this for internal audits.
3. Automate Process
Combine Clementine, SPSS, SAS & other processes
Scheduling & notification
4. Centralize and Control Access
Contact information
Project personnel:
Ted Fischer – ted_fischer@sra.com or theodore.i.fischer@irs.gov, 301-731-3534 Anthony Colyandro – anthony_colyandro@sra.com or anthony.colyandro@irs.gov, 301-731-3524
Dave Vennergrund – dave_vennergrund@sra.com, 703-803-1614
SRA Director of Business Intelligence
How do I get SPSS software?
IRS Cathy J. Allen Enterprise System Management Software Management Section Idea Branch - MS 5850 (304) 264-7279 - voice (304) 279-5309 - cell (304) 260-3033 - fax cathy.j.allen@irs.gov SPSS Contacts: Account Executive – Sarah Mattingly Email: smattingly@spss.com W – 703-740-2446 C – 703-389-6485 Account Manager – Matt Madden W - 312 651 3894
Predictive Tax Compliance
Presentation to the IRS
SPSS
SRA
Benjamin Chard Senior Solution Engineer bchard@spss.com
Sarah Mattingly IRS Account Executive smattingly@spss.com
Ted Fischer Project Manager ted_fischer@sra.com or theodore.i.fischer@irs.gov 301-731-3534