Document Sample
Detection Powered By Docstoc
					Statistical Issues and Challenges
Associated with Rapid Detection
of Bio-Terrorist Attacks

   SE Fienberg and G Shmueli

         Presented by Lisa Denogean
Detection Problems
n   Traditionally used medical and public
    health data may take months to collect,
    obtain, and analyze
    – Need better system for collection, efficient
      detection and privacy protection

n   Real-time collection often does not result
    in enough data, the signal is too weak for
    – Need to be able to collect and effectively
      analyze more data from different sources

n   System and Data Requirements for Timely

n   Grocery Sales Data Example: Combining
    Data Across Sources

n   Advantages and Disadvantages of
    Different Data Sources
Detection System and Data
Types of Data Available
n   Traditional data
    – ER visits, 911 calls, mortality records, veterinary reports,
      school or work absence records…
n   Non-traditional
    – To detect known agent, e.g. anthrax
    – OTC medication sales, grocery (e.g. OJ and soup) sales

Initial Data Requirements
n   Frequently collected
    – Real-time, frequent non-traditional data, or improved
n   Fast transfer
    – Electronic recording and data conversion
Essential Data Features
n   Early signature of the outbreak
    – Data allows detection of a disease signature a day or
      week before the disease apparent
    – OTC sales, website searches, bio-sensors

n   Sufficient amounts of data
    – Lack of sufficient data leads to under-detection
    – Temporal or spatial aggregation, but could slow
      detection or dampen a signal

n   Local, not regional or national data
    – Improves sensitivity and timeliness
Detection System Requirements
n   Immediate analysis of incoming data
    – Resources for quick storage and efficient
      detection algorithms
n   Immediate output
    – Output an operational decision-making
      conclusion in a user-friendly transferable format
n   Flexibility
    – Almost or fully automated for different outbreak

    – Number of false alarms vs. speed of true
      detection rate
    – Expense of false alarms vs. risk of not detecting
      true outbreak

n   NYC syndromic surveillance system*
    – Track 911 calls, OTC sales, ER admissions, absenteeism
      (weekly false alarms)
n   Real-time outbreak and disease surveillance
    (RODS) system
    – Real-time collection of ER visits in Western Pennsylvania
      (including retailer data)
n   National Electronic Disease Surv. System
    – CDC initiative for electronic transfer of health
n   New sources (not yet available)
    – Track medical web searches, body tracking devices,
      biosensor data
Grocery Sales Example

Inhalational Anthrax
n   First stage
    – A few hours to a few days (assume within 3 days)
    – Nonspecific symptoms: fever, sweat, fatigue,
      cough, sore throat, nausea, headache
    – Similar to flu symptoms, except no runny nose
    – Rapid treatment improves survival
n   Second stage
    – Develops rapidly
    – Extreme symptoms
    – At least 80% fatality rate within 2 – 48 hours
Sales Data Features
n   Data electronically recorded in real-time
n   Large amounts of data at rich levels of detail
n   Processing time vs. level of detail considerations
n   Aggregated level of daily sales for each item
    and hourly basket-level data
n   Purchase data are localized, useful for detecting
    large-scale outbreaks in small areas
n   OTC and grocery sales can show an early
    signature of symptoms of an outbreak
n   Dependence between sales within neighboring
    periods of time due to fine time scale
n   Smaller ratio between signal and noise
Statistical Detection System
n   Decide which items to monitor
    – Epidemiological and statistical analysis of
      information contained in different sales
n   Model the “no-outbreak” sales baseline
    – Account for promotions, sales, season, etc that
      would add noise (clean data)
n   Simulate an outbreak signature
    – Footprint of anthrax known in traditional data,
      consult with outside experts for new data
n   Develop a roll-forward algorithm
    – Integrate previous data for detection in new data
n   Test system for real and false alarms
    – Select threshold based on simulations

       Nasal symptoms
       are unrelated to

       Focus on cough
       meds (daily)
       and tissues, OJ
       and soup
n   Data indicates seasonal effect in overall sales
    and includes flu cases
n   Assume cough meds insensitive to promotions
n   Smoothing methods applied
n   Estimate baseline variability
n   False alarms near holidays for all methods

n   Epidemiologist opinions on how anthrax is
    manifested in cough medication sales
n   Sales increase linearly over 3 day period
Detection System Formulation
Detail from reference [12]

n   Clean data
      – Preprocess: Account for store level sales
      – Filter/De-noise: Decompose series into cosine
        waves, retain those with large magnitudes

n   Forecast via wavelet approach
      – Efficient and tractable for non-stationary series
      – Autoregressive moving average model not flexible
        to data type, user intervention required
      – Decompose series into resolutions of different
      – For each resolution, use autoregressive model for
        forecasting the next point
Detection System (cont.)
n   Threshold for next-day forecasts
    – Control chart type argument to determine anthrax-
      related variability
    – Alarm if true sales more than 3 standard deviations
      above de-noised series prediction
n   Basket-level (50 products, 200k-500k/week)
    – Method of association rules: Pairs and triplets
    – Threshold: Most unexpected combinations
n   Evaluation
    – Simulate anthrax footprint as 3 day spike linearly
      increasing pattern
    – Study different configurations of the system
    – If the scale of the footprint increases cough sales by
      factor of 1.36 or more, 100% footprints detected
    – Outbreaks coinciding with holidays problematic
Combining Data Sources: Benefits
and Challenges

Data Linkage
n   Linking data from multiple sources
    requires system-wide unique identifiers or
    variables for record linkage

n   Linkage methods use match features or
    string distances
    – Need extensions that link multiple lists and allow
      for missing identifiers
Approaches to Using Multiple
Data Sources

n   Independently and simultaneously monitor
    separate sources
    – Multiple testing inflates false alarm rate
n   Track different series intensively but sequentially
    – Alarms trigger further data collection and analyses
      of other series (Univ. of Utah – flu)
    – Hierarchical signaling
n   Multivariate modeling
    – Use merged records for individuals or families
    – Measurement error from record linkage
    – Privacy and confidentiality concerns
Privacy and Confidentiality Issues
n   Health Ins. Portability and Accountability
    Act (HIPPA) restrictions
    – Permits de-identified data for research
    – Medical and public health org. may be exempt

n   Private commercial interests
    – Concern over information in grocery and OTC
      sales data

n   Integrated data concerns
    – Linking across databases may pose more risks in
      exposing confidential information
Summary: Questions for
n   What and how do non-traditional data carry signals
    of an outbreak?
n   How can we efficiently and accurately integrate
    and analyze data from multiple sources?
n   How can we effectively temporally or spatially
    aggregate data?
n   How can we use geographic detail to control
    excessive false alarms?
n   Can merged files useful for detection not allow for
    re-identification and linkage to source?
n   Is a risk-utility trade-off tolerable?
n   Can a trusted third-party update files in real-time,
    separately from the detection system?

Shared By:
huangyuarong huangyuarong