Docstoc

Data Mining

Document Sample
Data Mining Powered By Docstoc
					Dirty Data on Both Sides of the
Pond: Results of the GIRO
Working Party on Data Quality

   2008 CAS Ratemaking Seminar
           Boston, Ma.

                                  1
Data Quality Working Party
Members
 Robert Campbell
 Louise Francis (chair)
 Virginia R. Prevosto
 Mark Rothwell
 Simon Sheaf




                             2
Agenda
 Literature Review
 Horror Stories
 Survey
 Experiment
 Actions
 Conclusions




                      3
 Literature Review
Data quality is maintained and improved by good
data management practices. While the vast majority
of the literature is directed towards the I.T. industry,
the paper highlights the following more actuary- or
insurance-specific information:

Actuarial Standard of Practice #23: Data Quality
Casualty Actuarial Society White Paper on Data Quality
Insurance Data Management Association (IDMA)
Data Management Educational Materials Working Party


                                                           4
  Actuarial Standard of Practice #23

Provides descriptive standards for:
   •   selecting data,
   •   relying on data supplied by others,
   •   reviewing and using data, and
   •   making disclosures about data quality
 http://www.actuarialstandardsboard.org/pdf/asops/asop023_097.pdf




                                                                     5
 Insurance Data Management
 Association
The IDMA is an American organization which
 promotes professionalism in the Data Management
 discipline through education, certification and
 discussion forums
The IDMA web site:
   •   Suggests publications on data quality,
   •   Describes a data certification model, and
   •   Contains Data Management Value Propositions
       which document the value to various insurance
       industry stakeholders of investing in data quality
 http://www.idma.org
                                                            6
Cost of Poor Data
 Olson – 15% - 20% of operating profits
 IDMA - costs the U.S. economy $600 billion a
  year.
 The IDMA believes that the true cost is higher
  than these figures reflect, as they do not
  depict “opportunity costs of wasteful use of
  corporate assets.” (IDMA Value Proposition –
  General Information).


                                                   7
 CAS Data Management Educational
 Materials Working Party
 Reviewed a shortlist of texts recommended by
  the IDMA for actuaries (9 in total)
 Publishing a review of each text in the CAS
  Actuarial Review (starting with the current
  (August) issue)
 Paper published in the Winter 2007 CAS
  Forum combines and compares reviews
 “Actuarial IQ (Information Quality)” published
  in the Winter 2008 CAS Forum

                                                   8
Agenda
 Literature Review
 Horror Stories
 Survey
 Experiment
 Actions
 Conclusions




                      9
Horror Stories – Non-Insurance
 Heart-and-Lung Transplant – wrong blood
  type
 Surgery on wrong side very frequent, but
  preventable
 Bombing of Chinese Embassy in Belgrade
 Mars Orbiter – confusion between imperial
  and metric units
 Fidelity Mutual Fund – withdrawal of dividend
 Porter County, Illinois – Tax Bill and Budget
  Shortfall
                                                  10
Horror Stories – Rating/Pricing
 Exposure recorded in units of $10,000
  instead of $1,000
 Large insurer reporting personal auto data as
  miscellaneous and hence missed from
  ratemaking calculations
 One company reporting all its Florida property
  losses as fire (including hurricane years)
 Mismatched coding for policy and claims data



                                               11
Horror Stories - Reserving
 NAIC concerns over non-US country data
 Canadian federal regulator uncovered:
     Inaccurate accident year allocation
     Double-counted IBNR
     Claims notified but not properly recorded




                                                  12
Horror Stories - Reserving
June 2001, the Independent in liquidation, became the
  U.K.’s largest general insurance failure.
 A year earlier, its market valuation had reached £1B.
 Independent’s collapse came after an attempt to
  raise £180M in fresh cash by issuing new shares
  failed because of revelations that the company faced
  unquantifiable losses.
 The insurer had received claims from its customers
  that had not been entered into its accounting system,
  which contributed to the difficulty in estimating the
  company’s liabilities.
                                                      13
Horror Stories - Katrina
 US Weather models underestimated costs for
  Katrina by approx. 50% (Westfall, 2005)
 2004 RMS study highlighted exposure data that was:
      Out-of-date
      Incomplete
      Mis-coded
 Many flood victims had no flood insurance after being
  told by agents that they were not in flood risk areas.




                                                           14
Agenda
 Literature Review
 Horror Stories
 Survey
 Experiment
 Actions
 Conclusions




                      15
Data Quality Survey of Actuaries
 Purpose: Assess the impact of data quality
  issues on the work of general insurance
  actuaries
 2 questions:
     percentage of time spent on data quality
      issues
     proportion of projects adversely affected by
      such issues


                                                     16
Results of Survey




                    17
 Survey Conclusions
 Data quality issues have a significant impact
  on the work of general insurance actuaries
     about a quarter of their time is spent on such
      issues
     about a third of projects are adversely affected
 The impact varies widely between different
  actuaries, even those working in similar
  organizations
 Limited evidence to suggest that the impact is
  more significant for consultants
                                                         18
Agenda
 Literature Review
 Horror Stories
 Survey
 Experiment
 Actions
 Conclusions




                      19
Hypothesis
Uncertainty of actuarial estimates of ultimate
incurred losses based on poor quality data is
significantly greater than those based on better
quality data




                                                   20
Data Quality Experiment
 Examine the impact of incomplete and/or
  erroneous data on the actuarial estimate of
  ultimate losses and the loss reserves
 Use real data with simulated limitations
  and/or errors and observe the potential error
  in the actuarial estimates




                                                  21
Data Used in Experiment
 Real data for primary private passenger
  bodily injury liability business for a single no-
  fault state
 Eighteen (18) accident years of fully
  developed data; thus, true ultimate losses are
  known




                                                  22
Actuarial Methods Used
 Paid chain ladder models
      Bornhuetter-Ferguson
      Berquist-Sherman Closing rate adjustment
 Incurred chain ladder models


 Inverse power curve for tail factors


 No judgment used in applying methods

                                                  23
Completeness of Data Experiments
Vary size of the sample; that is,
      1) All years
      2) Use only 6 accident years
      3) Use only last 3 diagonals




                                     24
Data Error Experiments
Simulated data quality issues:
1. Misclassification of losses by accident year
2. Late processing of financial information
3. Overstatements followed by corrections in
   following period
4. Definition of reported claims changed
5. Early years unavailable



                                                  25
Measure Impact of Data Quality
•   Compare Estimated to Actual Ultimates
•   Use Bootstrapping to evaluate effect of
    different random samples on results




                                              26
Estimated Ultimates based on
Paid Losses




                               27
Est. Ults. based on Adjusted Paid




                                    28
Est. Ults. based on Incurred Losses




                                  29
Results of Adjusting Experience Period
  The adjusted paid and the incurred methods produce
   reasonable estimates for all but the most immature points.
   However these points contribute the most dollars to the reserve
   estimate.
  The paid chain ladder method, which is based on less
   information (no case reserves, claim data or exposure
   information), produces worse estimates than the methods based
   on the incurred data or the adjusted paid data.
  Methods requiring more information, such as Bornhuetter-
   Ferguson and Berquist-Sherman, performed better
  It is not clear from this analysis that datasets with more historical
   years of experience produce better estimates than datasets with
   fewer years of experience.



                                                                      30
Experiment Part 2
Next, we introduced three changes to simulate errors
   and test how they affected estimates:

1. Losses from accident years 1983 and 1984 have
   been misclassified as 1982 and 1983 respectively.
2. Approximately half of the financial movements from
   1987 were processed late in 1988.
3. Incremental paid losses for accident year 1988
   development period 12-24 overstated by a multiple
   of 10. This was corrected in the following
   development period. An outstanding reserve for a
   claim in accident year 1985 at the end of
   development month 60 was overstated by a multiple
   of 100 and was corrected in the following period.
                                                       31
Comparison of Actual and Estimated
Ultimate Losses
Based on Error-Modified Incurred Loss Data




                                             32
Standard Errors for Adjusted Paid Loss
Data
Modified to Incorporate Errors




                                         33
Results of Introducing Errors
 The error-modified data reflecting all changes
  results in estimates having higher standard
  errors than those based on the clean data
 For incurred ultimate losses, the clean data
  has the lowest bias and lowest standard error
 Error-modified data produced:
     More bias in estimates
     More volatile estimates


                                               34
Distribution of Reserve Errors




                                 35
Results of Bootstrapping
 Less dispersion in results for error free data
 Standard deviation of estimated ultimate
  losses greater for the modified data (data with
  errors)
 Confirms original hypothesis:
  Errors increase the uncertainty of estimates




                                                   36
 Conclusions Resulting from
 Experiment
 Generally greater accuracy and less variability in
  actuarial estimates when:
      Quality data used
      Greater number of accident years used
 Data quality issues can erode or even reverse the
  gains of increased volumes of data:
      If errors are significant, more data may worsen
       estimates due to the propagation of errors for certain
       projection methods
 Significant uncertainty in results when:
    Data is incomplete
    Data has errors
                                                                37
Agenda
 Literature Review
 Horror Stories
 Survey
 Experiment
 Actions
 Conclusions




                      38
Actions – What can we do?
 Data Quality Advocacy
 Data Screening or Exploratory Data Analysis
  (EDA)




                                                39
Data Quality Advocacy
 Data quality – measurement
 Data quality – management issues




                                     40
DQ - Measurement
 Redman – advocates a sampling approach to
  measurement & depends on how advanced
  data quality at company currently is
 Other – automated validation/accuracy
  techniques




                                          41
 DQ - Management Issues
 Manage Information Chain
 (from Data Quality, the Field Guide by Redman)
     establish management responsibilities
     describe information chart
     understand customer needs
     establish measurement system
     establish control and check performance
     identify improvement opportunities
     make improvements
                                                42
Data Screening

 Visual
      Histograms
      Box and Whisker Plots
      Stem and Leaf Plots
 Statistical
    Descriptive statistics
    Multivariate screening

 Software – Excel, R, SAS, etc.


                                   43
Example Data
 Texas Workers Compensation Closed Claims
 Some variables are:
     Incurred Losses
     Paid Losses
     Attorney Involvement
     Injury Type
     Cause Type
     County



                                         44
Box Plot




           45
Box and Whisker Plot




                       46
Histogram/Frequency Polygon




                              47
Categorical Data – Bar Plots




                               48
Descriptive Statistics
 Year of Date Licensed has minimum and
  maximums that are impossible




                                          49
Agenda
 Literature Review
 Horror Stories
 Survey
 Experiment
 Actions
 Conclusions




                      50
Conclusions
 Anecdotal horror stories illustrate possible dramatic
    impact of data quality problems
   Data quality survey suggests data quality issues have
    a significant cost on actuarial work
   Data quality experiment shows data quality issues
    have significant effect on accuracy of results
   Data Quality Working Party urges actuaries to
    become data quality advocates within their
    organizations
   Techniques from Exploratory Data Analysis can be
    used to detect data anomalies
                                                          51
Dirty Data on Both Sides of the
Pond: Results of the GIRO
Working Party on Data Quality


         Questions?

                              52

				
DOCUMENT INFO