Docstoc

Statistical Databases

Document Sample
Statistical Databases Powered By Docstoc
					Security Methods for
Statistical Databases
   by Karen Goodwin
                Introduction

   Statistical Databases containing medical
    information are often used for research
   Some of the data is protected by laws to
    help protect the privacy of the patient
   Proper security precautions must be
    implemented to comply with laws and
    respect the sensitivity of the data
   Accuracy vs. Confidentiality
Accuracy –              Confidentiality –
 Researchers want to      Patients, laws and
 extract accurate and     database
 meaningful data          administrators want to
                          maintain the privacy
                          of patients and the
                          confidentiality of their
                          information
                            Laws
   Health Insurance Portability and Accountability Act –
    HIPAA (Privacy Rule)
   Covered organizations must comply by April 14, 2003
   Designed to improve efficiency of healthcare system by using
    electronic exchange of data and maintaining security
   Covered entities (health plans, healthcare clearinghouses,
    healthcare providers) may not use or disclose protected
    information except as permitted or required
   Privacy Rule establishes a “minimum necessary standard” for
    the purpose of making covered entities evaluate their current
    regulations and security precautions
           HIPAA Compliance
   Companies offer 3rd Party Certification of
    covered entities
   Such companies will check your company
    and associating companies for compliance
    with HIPAA
   Can help with rapid implementation and
    compliance to HIPAA regulations
    Types of Statistical Databases

   Static – a static         Dynamic – changes
    database is made           continuously to reflect
    once and never             real-time data
    changes                   Example: most online
   Example: U.S. Census       research databases
Security Methods
   Access Restriction
   Query Set Restriction
   Microaggregation
   Data Perturbation
   Output Perturbation
   Auditing
   Random Sampling
               Access Restriction
   Databases normally have different access levels
    for different types of users
   User ID and passwords are the most common
    methods for restricting access
      In a medical database:
          Doctors/Healthcare Representative – full access to
           information
          Researchers – only access to partial information
           (e.g. aggregate information)
         Query Set Restriction
   A query-set size control can limit the
    number of records that must be in the
    result set
   Allows the query results to be displayed
    only if the size of the query set satisfies
    the condition
   Setting a minimum query-set size can help
    protect against the disclosure of individual
    data
          Query Set Restriction

   Let K represents the minimum number or
    records to be present for the query set
   Let R represents the size of the query set
   The query set can only be displayed if
                        KR
           Query Set Restriction


 Query 1                             Query 2


                          Original
                         Database
                                     Query 2       Query
                                     Results   K   Results
Query 1    K   Query
Results        Results
             Microaggregation
   Raw (individual) data is grouped into small
    aggregates before publication
   The average value of the group replaces each
    value of the individual
   Data with the most similarities are grouped
    together to maintain data accuracy
   Helps to prevent disclosure of individual data
              Microaggregation

   National Agricultural Statistics Service (NASS)
    publishes data about farms
   To protect against data disclosure, data is only
    released at the county level
   Farms in each county are averaged together to
    maintain as much purity, yet still protect against
    disclosure
      Microaggregation
Age                   Microaggregated
                            Age

10                         11.67


12          Average        11.67

13                         11.67




57                         56.67


54          Average        56.67

59                         56.67
           Microaggregation
                                             User




                                     ry
                                      e



                                                 lts
                                   Qu



                                              su
                                             Re
                Averaged
Original                   Microaggregated
 Data                           Data
            Data Perturbation
   Perturbed data is raw data with noise
    added
   Pro: With perturbed databases, if
    unauthorized data is accessed, the true
    value is not disclosed
   Con: Data perturbation runs the risk of
    presenting biased data
            Data Perturbation
                                      User 1
            Noise Added




                                ry
                                e


                                             lts
                             Qu



                                          su
                                         Re
 Original                 Perturbed
Database                  Database
                                         Re
                                              su
                                                 lts




                               Q
                                ue
                                    ry
                                               User 2
           Output Perturbation

   Instead of the raw data being transformed
    as in Data Perturbation, only the output or
    query results are perturbed
   The bias problem is less severe than with
    data perturbation
            Output Perturbation
                                            User 1
                            Query
                        y
                    Quer
                                          Results
                               su   lts
                            Re                       Noise Added
                                                      to Results
 Original
Database

                        Re
                             sul
                                 ts
            Query                     Results
             Que
                   ry

                                          User 2
                   Auditing

   Auditing is the process of keeping track of
    all queries made by each user
   Usually done with up-to-date logs
   Each time a user issues a query, the log is
    checked to see if the user is querying the
    database maliciously
           Random Sampling
   Only a sample of the records meeting the
    requirements of the query are shown
   Must maintain consistency by giving exact
    same results to the same query
   Weakness - Logical equivalent queries
    can result in a different query set
           Comparison Methods
The following criteria are used to determine the most effective
methods of statistical database security:
   Security – possibility of exact disclosure, partial
    disclosure, robustness
   Richness of Information – amount of non-
    confidential information eliminated, bias,
    precision, consistency
   Costs – initial implementation cost, processing
    overhead per query, user education
      A Comparison of Methods
       Method                    Security            Richness of               Costs
                                                     Information
Query-set Restriction               Low                    Low1                 Low
Microaggregation                 Moderate               Moderate             Moderate
Data Perturbation                   High            High-Moderate               Low
Output Perturbation              Moderate            Moderate-low               Low
Auditing                      Moderate-Low              Moderate                High
Sampling                         Moderate           Moderate-Low             Moderate
1 Quality is low because a lot of information can be eliminated if the query does not meet the
requirements
                                          Sources
   This presentation is posted on
    http://www.cs.jmu.edu/users/aboutams
   Adam, Nabil R. ; Wortmann, John C.; Security-Control
    Methods for Statistical Databases: A Comparative Study;
    ACM Computing Surveys, Vol. 21, No. 4, December
    1989 (http://delivery.acm.org/10.1145/80000/76895/p515-
    adam.pdf?key1=76895&key2=1947043301&coll=portal&dl=ACM&CFID=4702747&CFTOKEN=83773110)

   Official HIPAA – (http://cms.hhs.gov/hipaa/) incur
   Bernstein, Stephen W.; Impact of HIPAA on
    BioTech/Pharma Research: Rules of the Road
    (http://www.privacyassociation.org/docs/3-02bernstein.pdf)

   Service Bureau; 3rd Party Testing (http://hipaatesting.com/service_bureau.html)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:10/21/2011
language:English
pages:24