Protecting Statistical Databases Against Snoopers
Comparison of two methods
Disclosure vs. Anonymity
Information disclosure necessary for planning and numerical measurements
Anonymity necessary for protection of the individual and the public’s trust in systems
Medical Data
Necessary for: Measuring effectiveness of current treatments Finding sources of common medical mistakes Tracking contagious disease Government spending planning Health Insurance Companies
Anonymity: Not as Easy as it Looks
Race
Birth date
Profession
Zip code
Sex
Complete Identification Without Uniquely Identifying Information
Outside Factors Affecting Privacy
Snooper’s supplementary knowledge Public data sources Rarity
Comparing Two Methods of Protection
What are the privacy guarantees?
Can useful information be gained?
Sensitivity-based Noise-adding Algorithm
Proposed by Dwork, McSherry, Nissim and Smith Adds noise to each answer based on the sensitivity of the series of queries Amount of privacy based on ε, a coefficient in the noisegenerating formula
Sensitivity
How much could changing one row change an answer?
MEAN COUNT HISTOGRAMS
The sensitivity of a series of queries is the sum of the sensitivities of the queries
Coin-flip Algorithm
Proposed by Mishra and Sandler A way for individuals to publish their own personal data Amount of privacy based on ε, the bias in the coin-flip
Implementing the Coin-flip Algorithm
Each of the k possible answers to a query are ordered and numbered If an individual’s answer to the query is the ith answer, the profile would be a string of k bits where the ith is a one and the others are zero To sanitize, each bit is flipped with probability ½ + ε/2 All sanitized profiles resemble a random string of ones and zeros
Example: HIV status
Ordered possible responses: “POSITIVE, NEGATIVE, UNKNOWN” The original profile of an HIV+ individual: “1, 0, 0” Results of coin-flips: “STAY, FLIP, STAY” Resulting sanitized profile: “1, 1, 0” What do we know about the individual from the sanitized profile?
My Research
Compare the total amount of error generated by histogram / frequency queries Hypothesis: The noise-adding algorithm will generate less error for few queries and the coinflip algorithm will generate less error for many queries Research question: Where is the “sweet spot” where the error lines cross on a graph?
Sum of Error
4000.00%
3500.00%
3000.00%
sum of error as percent of n
2500.00%
2000.00%
Coinflip Noise Additio
1500.00%
1000.00%
500.00%
0.00% 1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381 Number of Frequency Queries
The “sweet spot” first occurs at 101 queries.
Sum of Error
4000.00%
3500.00%
3000.00%
sum of error as percent of n
2500.00%
2000.00%
Coinflip Noise Additio
1500.00%
1000.00%
500.00%
0.00% 1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381 Number of Frequency Queries
With the smallest histograms first, the first “sweet spot” occurs at 32 queries.
Sum of Error
4000.00%
3500.00%
3000.00%
sum of error as percent of n
2500.00%
2000.00%
Coinflip Noise Addition
1500.00%
1000.00%
500.00%
0.00% 1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381 Number of Frequency Queries
With the largest histograms first, the first “sweet spot” occurs at 189 queries.
A Second Look
Sum of Error
4000.00% 3500.00% 3000.00%
Range of sensitivity: 2 to 136 Unordered histograms:
At
sum of error as percent of n
2500.00%
2000.00%
Coinflip Noise Addition
1500.00%
1000.00%
500.00%
0.00% 1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381 Number of Frequency Queries
first “sweet spot”, sensitivity= 30.
Sum of Error
4000.00%
3500.00%
3000.00%
sum of error as percent of n
2500.00%
2000.00%
Coinflip Noise Addition
Smallest histograms first:
At
1500.00%
1000.00%
500.00%
first “sweet spot”, sensitivity= 32.
0.00% 1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381 Number of Frequency Queries
Sum of Error
4000.00%
3500.00%
3000.00%
sum of error as percent of n
2500.00%
Coinflip Noise Addition
Largest histograms first:
At
2000.00%
1500.00%
1000.00%
first “sweet spot”, sensitivity= 34.
500.00%
0.00% 1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381 Number of Frequency Queries
Difference in Error
1600.00%
1400.00%
1200.00%
Difference in percent error
1000.00%
800.00%
600.00%
400.00%
200.00%
0.00% 2 -200.00% Sensitivity 12 22 32 42 52 62 72 82 92
Conclusions
For histogram / frequency queries, “sweet spots” occur between sensitivity=30 and sensitivity=40, so for least error:
If
sensitivity < 30, use NOISE-ADDING algorithm If sensitivity > 40, use COIN-FLIP algorithm
Quick Bibliography
Survey:
N
R Adam and J C Wortmann. Security-control methods for statistical databases: a comparative study. ACM Computing Surveys, 25(4), December 1989. Dwork, F McSherry, K Nissim, A Smith. Calibrating noise to sensitivity in private data analysis. 3rd Theory of Cryptography Conference, 2006. Mishra, M Sandler. Symposium on Principles of Database Systems, 2006.
Noise-adding algorithm:
C
Coin-flip algorithm:
N
Professor Nina Mishra, PhD
Professor Alf Weaver, PhD
REU program at UVa, sponsored by the National Science Foundation