Docstoc

bill-peter

Document Sample
bill-peter Powered By Docstoc
					                     Minimizing False Alarms in Syndromic
                                 Surveillance
                             2007 International Society for Disease Surveillance 6th Annual Conference
                                   Track 2: Analytical Methods: Multivariate Detection Methods
                                                          11 October 2007

                                                                                  William Peter, PhD
                                                                                  Amir Najmi, DPhil
                                                                                 Howard Burkom, PhD

                                      The Johns Hopkins University Applied Physics Laboratory




This presentation was supported by Grant Number R01-PH000024 from the Centers for Disease Control and Prevention. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC.
                         Should counts or proportions be used
                        for biosurveillance anomaly detection?

Can Mutual Information (M.I.) methods be used to decide this?




                                                                                     all diagnostic counts
                   “Context” example: Total                                    600
 syndrome counts




                      daily emergency room
                      visit counts
              40                                                               400
              30
              20                                                               200
              10                                                  “Target”: Daily
               0                                                     emergency   0
             02/28/94      06/07/94           09/15/94     12/24/94 room04/03/95
                                                                            Resp
                                                                     counts
                                                date
                   • Monitor target syndrome visit counts or target/context ratios
                   for better signal-to-noise properties?

                   • What kind of ratios (i.e., “contexts”) should be used?
                                                                                          ISDS_2
                                      Introduction: False Alerts
                       Identification of public health threats or disease outbreaks
                       in the Public Health Information Network relies on
                       opportunistic data streams that vary across geographic
                       regions and time periods and can be biased by data
                       collection, seasonality, etc.

                                Is this increase in
                                  respiratory counts
                                                                                                                                   600
                                  of epidemiological
                                                                                               Maybe
syndrome counts




                                       interest?                                                                                   500
                                                                                                 Not




                                                                                                                                          all diagnostic counts
                  35                                                                               .


                                                                      syndrome counts
                                                                                                                                   400
                  30
                                                                                        35
                  25                                                                    30                                         300

                  20                                                                    25
                                                                                        20                                         200
                  15
                                                                                        15
                  10                                                                                                               100
                                                                                        10
                   5                                                                     5

            02/28/94      06/07/94   09/15/94   12/24/94   04/03/95                02/28/94   06/07/94   09/15/94   12/24/94   04/03/95
                                                                                                          date
                                      date
                                                                                                                                     ISDS_3
                    Proportions Can Clarify Outbreaks and Highlight
                              Alerts That Might Be Lost.




                                                                                                       all diagnostic counts
                                                                                                 600
                            syndrome counts
                                              40                                                 400
                                              30
                                              20                                                 200
                                              10
                                               0                                                 0
                                                    4/16/94      7/25/94   11/02/94   12/10/95
           syndrome proportions




                                              0.1
                                                              Alert
                                    0.08

                                    0.06

                                    0.04
                                                    4/16/94      7/25/94   11/02/94   12/10/95

When a denominator variable such as total syndrome counts is
available, a proportion can clarify an outbreak, and prevent false alerts.
                                                                                                                               ISDS_4
                                                Example #2: Chief Complaint Gastro-
                                                    Intestinal Syndrome Data
                                         Chief Complaint Syndrome Data
                             140

                             120
                                                                                     Disease tracking
ER gastrointestinal counts




                             100
                                                                                     systems in a health
                                                                                     information network that
                              80                                                     rely on detecting
                                                                                     sudden increases in a
                              60                                                     syndromic count are not
                                                                                     robust to changes in
                              40                                                     health care utilization,
                                                                                     seasonality, data
                              20                                                     collection, etc. (cf. Reis
                                                                                     et. al., this meeting).
                               0
                             10/02/01        06/28/03          03/23/05   12/17/06
                                                        date

                                    Has an alert of
                                       epidemiological interest
                                       been detected here?
                                                                                                             ISDS_5
                             False Alerts Can Arise from changes in
                                baseline surveillance monitoring.

            Total syndrome visit counts
           increased due to another county
              data stream coming online.
                     individual GI and total syndrome data
         1200                                                               False alerts can arise
                                                                            when a surveillance
         1000                                                               system fails to adjust
                                                                            to major shifts in
          800                                                               monitored health data
                                                                            streams or baseline
counts




          600                                                               healthcare utilization.

          400                                                               Reis BY, Kohane IS, Mandl
                                               total ER counts              KD. An epidemiological
                                               GI ER counts                 network model for disease
          200
                                                                            outbreak detection. PLoS
                                                                            Med 2007, 4:e210.
            0
          03/01/02         10/06/03          05/12/05            12/17/06
                                      date
                                                                                                   ISDS_6
                      Reis et al.’s Network Model:
            Use all possible ratios (“relationships") of health
            data streams instead of the streams themselves.
                                         An epidemiological network
                                         containing 15 data streams from 5
                                         hospitals is shown responding to
                                         a simulated outbreak in data
                                         stream T-5.

                                         Each data stream appears twice
                                         in the network: The context nodes
                                         on the left are used for
                                         interpreting the activity of the
                                         target nodes on the right. Each
                                         edge represents the ratio of the
                                         target node divided by the context
                                         node, with a thicker edge
                                         indicating that the ratio is
Consensus view is constructed for each   anomalously high.
  node by combining all perspectives.
                                            For N health data series—
Reis BY, Kohane IS, Mandl KD. PLoS Med        high computational load
2007, 4:e210.                                 of N(N-1)                       ISDS_7
                                                                   Immediate Questions

                       • When should individual counts be used? When is it a
                         benefit to use ratios? What kind of ratios between the
                         health streams should be formed?
                       • Can the Reis model be computationally simplified such
                         that only “important ratios” are considered instead of
                         every possible ratio?

                                                  ’match.com’ Hypothesis:

                                                                                           Choose a context “that has something in common”
                                                                                           with a given target.

                                                           Quantitatively, the target and context should have
                                                  sufficient mutual information.
                                                                   all diagnostic counts




                                                            600
syndrome counts




             40                                             400
             30
             20                                             200
             10
              0
            02/28/94   06/07/94   09/15/94   12/24/94
                                                            0
                                                        04/03/95
                                                                                           Otherwise, they should not be paired.
                                    date




                                                                                                                                             ISDS_8
    Mutual Information measures the information that
               two random variables share.

Given two time series X = {x1, x2, … xn) and Y = {y1, y2, …
yn), their mutual information (M.I.) is how well we can
predict X given that we have measured Y (and vice-versa).

I(x;y) measures how much knowing one of these variables
reduces our uncertainty about the other.


                                           p ( x, y ) 
     I ( x; y )          p( x, y ) log 
                                           p( x) p( y ) 
                                                         
                    x   y

p(x) is the probability density function (pdf) of X = {x1, x2, …
xn), p(y) is the pdf of the time series Y = {y1, y2, … yn), and
p(x,y) is the joint probability density function of X and Y.


                                                                   ISDS_9
                Mutual Information: Examples

I(x;y) measures how much knowing one random variable
reduces our uncertainty about the other variable.

-- X = [0 0 0 1 1 0] and Y = [1 1 1 0 0 1]  I(x;y) = 1 (normalized).

-- if X and Y are independent, p(x, y) = p(x)p(y), then

                                p( x, y)                           p ( x) p ( y ) 
   I ( x; y)   p( x, y) log                    p( x, y) log                 0
               x y              p ( x) p ( y )  x y                p ( x) p ( y ) 

knowing X does not give any information about Y and vice versa. Their mutual
information is zero.

-- if X and Y are identical then knowing X determines Y and vice versa. As a
result, the mutual information is the same as the uncertainty contained in Y (or
X) alone, namely the entropy of Y, H(Y).
                                                                                           ISDS_10
              Mutual Information: Relation to Entropy

The entropy („uncertainty‟ or „surprisability‟) of a random variable X with
prob. density function p(x) is


      H (X )         p( x) log p( x)
                        x

                                                                          H(Y|X)
 I(X,Y) is the uncertainty in X                                  I(X;Y)
 removed by knowing Y (or vice versa).                  H(X|Y)




      I ( X ;Y )  H ( X )  H ( X Y )                              H(X,Y)

                    H (Y )  H (Y X )
Uncertainty in X
                            Uncertainty in X after Y is known
                                                                                   ISDS_11
   Mutual Information Techniques are now
   being applied to a wide variety of fields.


• Bioinformatics

• Financial/Stock Market Time Series Data

• Medical Device Time Series (EEG, EKG, etc.)

• Feature Selection and Image Processing

• Beat Detection In Music



                                                ISDS_12
                                         Correlation and Mutual Information can both be
                                                used to compare two time series.
                                                                                                            Mutual information is similar to—but is the




                                                                   all diagnostic counts
                                                            600
                                                                                                            generalization of—linear correlation.
syndrome counts




             40                                             400
             30
             20                                             200
             10
              0                                             0
            02/28/94   06/07/94   09/15/94   12/24/94   04/03/95
                                    date                                                               1

                                                                                                     0.9

                                                                                                     0.8
                  Advantage in
                                                                                                     0.7
                                                                                mutual information

                  using correlation
                  over mutual                                                                        0.6
                  information is                                                                                                 100-day moving window
                                                                                                     0.5
                  faster and easier
                                                                                                     0.4
                  computation.
                                                                                                                              Normalized Mutual Information
                                                                                                     0.3
                                                                                                                              Correlation
                                                                                                     0.2

                                                                                                     0.1

                                                                                                       0
                                                                                                     02/28/94    06/07/94   09/15/94      12/24/94       04/03/95
                                                                                                                             date
                                                                                                                                                              ISDS_13
                 Why use M.I. instead of correlation?


Pearson correlation
coefficient is linear.
Not always relevant                      I(x;y) = 0.51                     I(x,y) = 0.57
as a summary
statistic!




Mutual information
Is more general. It
includes nonlinear                       I(x,y) = 0.65                    I(x,y) = 0.49
relationships.

                         Anscombe’s Quartet: The four y variables shown have the
                         same mean (m = 7.5), standard deviation (s = 4.12),
                         correlation (r = 0.81) and regression line ( y = 3 + 0.5 x ).
                                                                                         ISDS_14
           Mutual Information: Numerical Calculation Issues
                                                                                          Pseq1
                                                     600



I. Histogram problem: How many bins should be used to construct a histogram of
                                                     400


                                                     200

X and Y to calculate p(X), p(Y), and p(X,Y)?          0
                                                           0   0.002   0.004   0.006   0.008   0.01   0.012   0.014   0.016   0.018

                                                                                          Pseq2
                                                      4
     Use Adaptive Binning or Kernel Density           3


     Estimation (i.e., Parzen windows).               2

                                                      1

                                                      0
                                                           0       0.02        0.04        0.06       0.08        0.1         0.12




II. Because I(x;y) varies from zero to infinity—should normalize I(x;y) to
be between zero and unity.

    ˆ                    I ( X ;Y )
    I ( X ;Y ) 
                   H ( X )  H (Y ) / 2
                                               Witten & Frank (2005)


    ˆ                 I ( X ;Y )
    I ( X ;Y )                                Strehl & Ghosh (2002)
                     H ( X ) H (Y )
    I ( X ;Y )  1  exp 2 I ( X ;Y ) 
    ˆ                                           Dionísio et al. (2003)
                                                                                                                                 ISDS_15
                                                                       Construct Monte Carlo Simulations

                                                                                                    Rash Counts
                                                                                   10
                                                             Target:




                                                                          counts
                                                              Rash
                                                            Syndrome                5
                                                             Counts
                                                                                    0
                       0.06
                                                                                           Poisson Counts        I(x;y) = 0.019
                       0.05
                                                            Context:               70
                       0.04

                                                            Poisson
Poisson Distribution




                                                                          counts
                       0.03
                                                                                   60
                       0.02                                 Process
                                                            (m = 50)               50
                       0.01



                         0
                          0   20   40       60   80   100
                                        x

                                                                                   40


                                                            Context:            120
                                                                                           Poisson + 5*Rash       I(x;y) = 0.336
                                                            Mix Rash
                                                                       counts




                                                                                100
                                                             Counts
                                                                                   80
                                                              And
                                                            Poisson                60
                                                                                 40
                                                                                02/28/94       06/10/95          09/19/96          12/30/97
                                                                                                          date
                                                                                                                                       ISDS_16
              Monte Carlo Simulations Demonstrate Significance
                  of Choosing Appropriate Context Series.
                                                      0.35
•   Target = Rash                                                  signal+4
    syndrome                                                       signal+3
    counts                                             0.3

                        Alert Detection Probability
                                                                   signal+2
                                                                   signal+1
•   Context =                                         0.25
    Mix Poisson
    Noise with
    increasing                                         0.2
    Target
                                                      0.15
•   Inject artificial
    outbreak signal
                                                       0.1
    into the target.

•   Calculate                                         0.05
    Number of
    Detections for
                                                        0
    constant False
    Alarm Rate.                                              0   0.06         0.12   0.18   0.24     0.3

                                                                 Target-Context Mutual Information
                                                                                                       ISDS_17
                  ROC Curves from Monte Carlo Simulations:
                              strong outbreak



                  probability detection (pd) (pd)
                                                    Correlation = 0.816
•   Target =                                        1.00
    Rash
    syndrome
                                                    0.90
    counts.                                         0.80                            Correlation = - 0.009
•   Context =                                       0.70
    Mix Poisson
                                                    0.60
    Noise with                                                                            Target Only
    increasing                                      0.50
    Target                                                                                M.I.=0.019
    Signal                                          0.40
                                                                                          M.I.=0.047
                                                    0.30
•   Inject                                                                                M.I.=0.114
    artificial                                      0.20                                  M.I.=0.195
    signal into
    the target.                                     0.10                                  M.I.=0.27
                                                    0.00
•   Calculate
    ROC curve.                                          0.00     0.02     0.04     0.06      0.08       0.10
                                                                  probability false alarm (pfa)
                                                                                                            ISDS_18
              Monte Carlo Simulations for actual neurological
            syndromic complaint reports with different contexts
                                               0.40                      90


•   Target =                                                             80

                                                                         70


    Neuro                                      0.35
                                                                         60




                                                      Neuro Complaints
                  Probability Detection (pd)
                                                                         50

    syndrome                                                             40



    counts.                                                              30

                                                                         20
                                               0.30                      10

                                                                      0

•   Context =
                                                                    02/28/94   06/10/95          09/19/96   12/30/97
                                                                                          date



    Use                                        0.25
    different
    syndromic                                  0.20
    complaint                                                                                                                 target only
    time series                                                                                                               rash_2 (M.I.=0.323)
                                               0.15
•   Inject                                                                                                                    GI_1 (M.I.=0.324)
    artificial                                 0.10
                                                                                                                              fever_1 (M.I.=0.21)
    signal into
    the target.                                0.05
                                                                                                                              lesion_2 (M.I.=0.26)
                                                                                                                              resp_2 (M.I.=0.348)
•   Calculate
                                               0.00
    ROC curve.
                                                   0.00                                    0.05                        0.10     0.15                0.20

                                                                                                 Probability False Alarms (pfa)
                                                                                                                                                  ISDS_19
                                           An Example of a Time Series Mismatch
                                            for Gastrointestinal Syndromic Data
                                                                                                                            0.2
                 1600
                                                                                                                                            Probably not of
                                   series #2
                                                                                                                           0.15
                                                                                                                                             epidemiological
                                   ER/GI counts                                                                                                significance.




                                                                                                         Rate/Proportion
                 1200                                                        120
                                                                                                                            0.1




                                                                                    GI syndrome counts
Time Series #2




                                                                                                                           0.05
                  800                                                        80


                                                                                                                             0
                                                                                                                           03/01/02         10/06/03            05/12/05   12/17/06
                                                                                                                                                       date

                  400                                                        40

                                                                                                                           Correlation between the two series:
                                                                                                                                100-day moving window
                                                                               1
                  03/01/02      10/06/03            05/12/05   12/17/06
                                                                             0.8
                                            Dates
                                                                             0.6
                                                               Correlation




                                                                             0.4



                         This is the S&P 500 Stock Index                     0.2


                             time series from 3/1/2002 -                       0


                                       12/17/06                              -0.2


                                                                             -0.4
                                                                             10/02/01                                                 06/28/03                03/23/05        12/17/06
                                                                                                                                                 date                                 ISDS_20
                                                                     Loss of mutual information (-DI/Dt)/(1-I) can
                                                                       act as a robust outbreak alert detector




                                                                                                                                  all diagnostic counts
                                                                                                                600
                                 syndrome counts




                                              40                                                                400
                                              30
                                              20                                                                200
                                              10
                                               0                                                               0
                                             02/28/94       06/07/94       09/15/94           12/24/94     04/03/95
                                                                             date


                                                       Loss of mutual info                                                                        0.1




                                                                                                                 (dI/dt)/(I-1)
                                                       may act as a better
                                                       alert detector than                                                             0.05


                                                       simply using ratios.
                                                                                                                                                          0
                                                   1

                                     0.9                                                                                               0.09
                                     0.8                                                                         target/context
                                                                                                                                       0.08
mutual information/correlation




                                     0.7

                                     0.6
                                                                                                                                       0.07
                                     0.5

                                     0.4
                                                                                                                                       0.06
                                     0.3                                   mutual information
                                                                           correlation
                                     0.2                                                                                               0.05
                                     0.1

                                   0
                                                                                                                                                              09/15/94          12/24/94   04/03/95
                                 02/28/94                 06/07/94     09/15/94          12/24/94    04/03/95
                                                                        date                                                                                             date
                                                                                                                                                                                              ISDS_21
        Conclusions and Future Work


• Proportions or ratios for syndromic surveillance are best
  applied when the context series and the target series
  share sufficient mutual information (I(x;y) > 0.25).

• Streamline proportional modeling networks with mutual
  information criteria to reduce the computational load.

• Investigating a novel robust alert detector obtained by
  thresholding the mutual information loss
  (DI /Dt) / (1-I) between the target and context, instead of
  using the target-to-context ratio.


                                                           ISDS_22

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:3
posted:2/19/2010
language:English
pages:22