VIEWS: 3 PAGES: 22 POSTED ON: 2/19/2010 Public Domain
Minimizing False Alarms in Syndromic Surveillance 2007 International Society for Disease Surveillance 6th Annual Conference Track 2: Analytical Methods: Multivariate Detection Methods 11 October 2007 William Peter, PhD Amir Najmi, DPhil Howard Burkom, PhD The Johns Hopkins University Applied Physics Laboratory This presentation was supported by Grant Number R01-PH000024 from the Centers for Disease Control and Prevention. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC. Should counts or proportions be used for biosurveillance anomaly detection? Can Mutual Information (M.I.) methods be used to decide this? all diagnostic counts “Context” example: Total 600 syndrome counts daily emergency room visit counts 40 400 30 20 200 10 “Target”: Daily 0 emergency 0 02/28/94 06/07/94 09/15/94 12/24/94 room04/03/95 Resp counts date • Monitor target syndrome visit counts or target/context ratios for better signal-to-noise properties? • What kind of ratios (i.e., “contexts”) should be used? ISDS_2 Introduction: False Alerts Identification of public health threats or disease outbreaks in the Public Health Information Network relies on opportunistic data streams that vary across geographic regions and time periods and can be biased by data collection, seasonality, etc. Is this increase in respiratory counts 600 of epidemiological Maybe syndrome counts interest? 500 Not all diagnostic counts 35 . syndrome counts 400 30 35 25 30 300 20 25 20 200 15 15 10 100 10 5 5 02/28/94 06/07/94 09/15/94 12/24/94 04/03/95 02/28/94 06/07/94 09/15/94 12/24/94 04/03/95 date date ISDS_3 Proportions Can Clarify Outbreaks and Highlight Alerts That Might Be Lost. all diagnostic counts 600 syndrome counts 40 400 30 20 200 10 0 0 4/16/94 7/25/94 11/02/94 12/10/95 syndrome proportions 0.1 Alert 0.08 0.06 0.04 4/16/94 7/25/94 11/02/94 12/10/95 When a denominator variable such as total syndrome counts is available, a proportion can clarify an outbreak, and prevent false alerts. ISDS_4 Example #2: Chief Complaint Gastro- Intestinal Syndrome Data Chief Complaint Syndrome Data 140 120 Disease tracking ER gastrointestinal counts 100 systems in a health information network that 80 rely on detecting sudden increases in a 60 syndromic count are not robust to changes in 40 health care utilization, seasonality, data 20 collection, etc. (cf. Reis et. al., this meeting). 0 10/02/01 06/28/03 03/23/05 12/17/06 date Has an alert of epidemiological interest been detected here? ISDS_5 False Alerts Can Arise from changes in baseline surveillance monitoring. Total syndrome visit counts increased due to another county data stream coming online. individual GI and total syndrome data 1200 False alerts can arise when a surveillance 1000 system fails to adjust to major shifts in 800 monitored health data streams or baseline counts 600 healthcare utilization. 400 Reis BY, Kohane IS, Mandl total ER counts KD. An epidemiological GI ER counts network model for disease 200 outbreak detection. PLoS Med 2007, 4:e210. 0 03/01/02 10/06/03 05/12/05 12/17/06 date ISDS_6 Reis et al.’s Network Model: Use all possible ratios (“relationships") of health data streams instead of the streams themselves. An epidemiological network containing 15 data streams from 5 hospitals is shown responding to a simulated outbreak in data stream T-5. Each data stream appears twice in the network: The context nodes on the left are used for interpreting the activity of the target nodes on the right. Each edge represents the ratio of the target node divided by the context node, with a thicker edge indicating that the ratio is Consensus view is constructed for each anomalously high. node by combining all perspectives. For N health data series— Reis BY, Kohane IS, Mandl KD. PLoS Med high computational load 2007, 4:e210. of N(N-1) ISDS_7 Immediate Questions • When should individual counts be used? When is it a benefit to use ratios? What kind of ratios between the health streams should be formed? • Can the Reis model be computationally simplified such that only “important ratios” are considered instead of every possible ratio? ’match.com’ Hypothesis: Choose a context “that has something in common” with a given target. Quantitatively, the target and context should have sufficient mutual information. all diagnostic counts 600 syndrome counts 40 400 30 20 200 10 0 02/28/94 06/07/94 09/15/94 12/24/94 0 04/03/95 Otherwise, they should not be paired. date ISDS_8 Mutual Information measures the information that two random variables share. Given two time series X = {x1, x2, … xn) and Y = {y1, y2, … yn), their mutual information (M.I.) is how well we can predict X given that we have measured Y (and vice-versa). I(x;y) measures how much knowing one of these variables reduces our uncertainty about the other. p ( x, y ) I ( x; y ) p( x, y ) log p( x) p( y ) x y p(x) is the probability density function (pdf) of X = {x1, x2, … xn), p(y) is the pdf of the time series Y = {y1, y2, … yn), and p(x,y) is the joint probability density function of X and Y. ISDS_9 Mutual Information: Examples I(x;y) measures how much knowing one random variable reduces our uncertainty about the other variable. -- X = [0 0 0 1 1 0] and Y = [1 1 1 0 0 1] I(x;y) = 1 (normalized). -- if X and Y are independent, p(x, y) = p(x)p(y), then p( x, y) p ( x) p ( y ) I ( x; y) p( x, y) log p( x, y) log 0 x y p ( x) p ( y ) x y p ( x) p ( y ) knowing X does not give any information about Y and vice versa. Their mutual information is zero. -- if X and Y are identical then knowing X determines Y and vice versa. As a result, the mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y, H(Y). ISDS_10 Mutual Information: Relation to Entropy The entropy („uncertainty‟ or „surprisability‟) of a random variable X with prob. density function p(x) is H (X ) p( x) log p( x) x H(Y|X) I(X,Y) is the uncertainty in X I(X;Y) removed by knowing Y (or vice versa). H(X|Y) I ( X ;Y ) H ( X ) H ( X Y ) H(X,Y) H (Y ) H (Y X ) Uncertainty in X Uncertainty in X after Y is known ISDS_11 Mutual Information Techniques are now being applied to a wide variety of fields. • Bioinformatics • Financial/Stock Market Time Series Data • Medical Device Time Series (EEG, EKG, etc.) • Feature Selection and Image Processing • Beat Detection In Music ISDS_12 Correlation and Mutual Information can both be used to compare two time series. Mutual information is similar to—but is the all diagnostic counts 600 generalization of—linear correlation. syndrome counts 40 400 30 20 200 10 0 0 02/28/94 06/07/94 09/15/94 12/24/94 04/03/95 date 1 0.9 0.8 Advantage in 0.7 mutual information using correlation over mutual 0.6 information is 100-day moving window 0.5 faster and easier 0.4 computation. Normalized Mutual Information 0.3 Correlation 0.2 0.1 0 02/28/94 06/07/94 09/15/94 12/24/94 04/03/95 date ISDS_13 Why use M.I. instead of correlation? Pearson correlation coefficient is linear. Not always relevant I(x;y) = 0.51 I(x,y) = 0.57 as a summary statistic! Mutual information Is more general. It includes nonlinear I(x,y) = 0.65 I(x,y) = 0.49 relationships. Anscombe’s Quartet: The four y variables shown have the same mean (m = 7.5), standard deviation (s = 4.12), correlation (r = 0.81) and regression line ( y = 3 + 0.5 x ). ISDS_14 Mutual Information: Numerical Calculation Issues Pseq1 600 I. Histogram problem: How many bins should be used to construct a histogram of 400 200 X and Y to calculate p(X), p(Y), and p(X,Y)? 0 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 Pseq2 4 Use Adaptive Binning or Kernel Density 3 Estimation (i.e., Parzen windows). 2 1 0 0 0.02 0.04 0.06 0.08 0.1 0.12 II. Because I(x;y) varies from zero to infinity—should normalize I(x;y) to be between zero and unity. ˆ I ( X ;Y ) I ( X ;Y ) H ( X ) H (Y ) / 2 Witten & Frank (2005) ˆ I ( X ;Y ) I ( X ;Y ) Strehl & Ghosh (2002) H ( X ) H (Y ) I ( X ;Y ) 1 exp 2 I ( X ;Y ) ˆ Dionísio et al. (2003) ISDS_15 Construct Monte Carlo Simulations Rash Counts 10 Target: counts Rash Syndrome 5 Counts 0 0.06 Poisson Counts I(x;y) = 0.019 0.05 Context: 70 0.04 Poisson Poisson Distribution counts 0.03 60 0.02 Process (m = 50) 50 0.01 0 0 20 40 60 80 100 x 40 Context: 120 Poisson + 5*Rash I(x;y) = 0.336 Mix Rash counts 100 Counts 80 And Poisson 60 40 02/28/94 06/10/95 09/19/96 12/30/97 date ISDS_16 Monte Carlo Simulations Demonstrate Significance of Choosing Appropriate Context Series. 0.35 • Target = Rash signal+4 syndrome signal+3 counts 0.3 Alert Detection Probability signal+2 signal+1 • Context = 0.25 Mix Poisson Noise with increasing 0.2 Target 0.15 • Inject artificial outbreak signal 0.1 into the target. • Calculate 0.05 Number of Detections for 0 constant False Alarm Rate. 0 0.06 0.12 0.18 0.24 0.3 Target-Context Mutual Information ISDS_17 ROC Curves from Monte Carlo Simulations: strong outbreak probability detection (pd) (pd) Correlation = 0.816 • Target = 1.00 Rash syndrome 0.90 counts. 0.80 Correlation = - 0.009 • Context = 0.70 Mix Poisson 0.60 Noise with Target Only increasing 0.50 Target M.I.=0.019 Signal 0.40 M.I.=0.047 0.30 • Inject M.I.=0.114 artificial 0.20 M.I.=0.195 signal into the target. 0.10 M.I.=0.27 0.00 • Calculate ROC curve. 0.00 0.02 0.04 0.06 0.08 0.10 probability false alarm (pfa) ISDS_18 Monte Carlo Simulations for actual neurological syndromic complaint reports with different contexts 0.40 90 • Target = 80 70 Neuro 0.35 60 Neuro Complaints Probability Detection (pd) 50 syndrome 40 counts. 30 20 0.30 10 0 • Context = 02/28/94 06/10/95 09/19/96 12/30/97 date Use 0.25 different syndromic 0.20 complaint target only time series rash_2 (M.I.=0.323) 0.15 • Inject GI_1 (M.I.=0.324) artificial 0.10 fever_1 (M.I.=0.21) signal into the target. 0.05 lesion_2 (M.I.=0.26) resp_2 (M.I.=0.348) • Calculate 0.00 ROC curve. 0.00 0.05 0.10 0.15 0.20 Probability False Alarms (pfa) ISDS_19 An Example of a Time Series Mismatch for Gastrointestinal Syndromic Data 0.2 1600 Probably not of series #2 0.15 epidemiological ER/GI counts significance. Rate/Proportion 1200 120 0.1 GI syndrome counts Time Series #2 0.05 800 80 0 03/01/02 10/06/03 05/12/05 12/17/06 date 400 40 Correlation between the two series: 100-day moving window 1 03/01/02 10/06/03 05/12/05 12/17/06 0.8 Dates 0.6 Correlation 0.4 This is the S&P 500 Stock Index 0.2 time series from 3/1/2002 - 0 12/17/06 -0.2 -0.4 10/02/01 06/28/03 03/23/05 12/17/06 date ISDS_20 Loss of mutual information (-DI/Dt)/(1-I) can act as a robust outbreak alert detector all diagnostic counts 600 syndrome counts 40 400 30 20 200 10 0 0 02/28/94 06/07/94 09/15/94 12/24/94 04/03/95 date Loss of mutual info 0.1 (dI/dt)/(I-1) may act as a better alert detector than 0.05 simply using ratios. 0 1 0.9 0.09 0.8 target/context 0.08 mutual information/correlation 0.7 0.6 0.07 0.5 0.4 0.06 0.3 mutual information correlation 0.2 0.05 0.1 0 09/15/94 12/24/94 04/03/95 02/28/94 06/07/94 09/15/94 12/24/94 04/03/95 date date ISDS_21 Conclusions and Future Work • Proportions or ratios for syndromic surveillance are best applied when the context series and the target series share sufficient mutual information (I(x;y) > 0.25). • Streamline proportional modeling networks with mutual information criteria to reduce the computational load. • Investigating a novel robust alert detector obtained by thresholding the mutual information loss (DI /Dt) / (1-I) between the target and context, instead of using the target-to-context ratio. ISDS_22