Forrest by zhangyun


									Sensitive Data In a Wired World
 Negative Representations of Data

         Stephanie Forrest
      Dept. of Computer Science
        Univ. of New Mexico
          Albuquerque, NM

• Goal: Develop new approaches to data security and privacy that
  incorporate design principles from living systems:
   –   Survivability and evolvability
   –   Autonomy
   –   Robustness, adaptation and self repair
   –   Diversity
• Extends earlier work on computational properties of the immune
   – Intrusion detection
   – Automated response
   – Collaborative information filtering
                      Project Overview

• Immunology and data:
    – Negative representations of information
• Epidemiology and the Internet:
    – Social networks matter
    – The real world is not always scale free
• The social utility of privacy:
    – Why is privacy an important value in democratic societies?
    – Evolutionary perspective

•   Paul Helman and Cris Moore (UNM)
•   Robert Axelrod and Mark Newman (Univ. Michigan)
•   Matthew Williamson (Sana Security)
•   Rebecca Wright and Michael de Mare (Stevens)
•   Joan Feigenbaum and Avi Silberschatz (Yale)
    – Fernando Esponda’s post-doc next year.
How the Immune System Distributes Detection
•   Many small detectors matching nonself (negative detection).
•   Each detector matches multiple patterns (generalization).

•   Advantages of distributed negative detection:
     –   Localized (no communication costs)
     –   Scalable and tunable
     –   Robust (no single point of failure)
     –   Private
                 Applications to Computing

•   Anomaly detectors                earlier work
•   Information filters              earlier work
•   Adaptive queries                  future
•   Negative representations         in progress
    – A positive set DB is a set of fixed length strings.
    – A negative set NDB represents all the strings not in DB.
    – Intuition: If an adversary obtains a string from NDB, little
      information is revealed.

    –   U= All possible four character strings
    –   DB={juan, eric, dave}
    –   U-DB={aaaa, aaab, cris, john, luca, raul, tehj, tosh,.…}
    –   There are 264-3= 456973 strings in U-DB.

•   Can U-DB be represented efficiently, given |U-DB| >> |DB| ?
     – YES: There is an algorithm that creates an NDB of size polynomial in DB.
     – Strategy: Compress information using don’t care symbol. Other
                                 DB        U-DB    NDB

                                 000       001     01*
                                 101       010     0*1
                                 111       011     1*0

•   What properties does the representation have?
     – Membership queries are tractable (linear time even without indexing).
          • Other queries, information leakage are future work.
     – Inferring information from a subset of NDB (next slide).
     – Inferring DB from NDB is NP-Hard (note: not doing crypto):
          •   Currently investigating instance difficulty.
          •   Algorithms for increasing instance difficulty.
          •   On-line insert/delete algorithms preserve problem difficulty.
          •   Collaborations with R. Wright, M. de Mare, and C. Moore.
         What information is revealed by queries?
            (without assuming irreversibility)
•   Having access to a subset of NDB (or DB) yields some information about strings
    outside that subset:
     –   Assume NDB (or DB) is partitioned into n subsets.
•   To the query “Is x in DB,” what do I learn about x if x is not in my subset?
     –   Must consult n subsets of NDB to conclude that x is in DB.
     –   Must consult the subsets only until x is found (on average n/2).
     –   Assumes that we care more about DB than U-DB.

         Probability and information content as the membership of strings is
         revealed. DB contains 10% of all possible L-length strings (formulas).
              Private Set Intersection

• Determine which records are in the intersection of
  several databases i.e.
   – DB1  DB2  …  DBn
   – (NDB1  NDB2  …  NDBn)
• Each party may compute the intersection
   – DBi  (NDB1  NDB2  …  NDBn)
• Party i learns only the intersection of all the sets,
• And not the cardinality of the other sets.
                          Results cont.

• How might these properties be useful?
   –   Protect data from insider attacks
   –   Computing set intersections
   –   Surveys involving sensitive information
   –   Anonymous digital credentials
   –   Fingerprint databases
   –   Other ideas?
• Prototype implementations:
   – Perl, C
   – See demo
                         Computer Epidemiology
         Justin Balthrop, Mark Newman, Matt Williamson

                         IP network                                                      Email traffic
                         Adminstrator network                                            Address books
               250                                          10000



                 0                                             1
                     0      100      200        300   400           1    10        100         1000
                                  Degree k                                    Degree k
                                            Science 304:527-529 (2004)

•   Information spreads over networks of social contacts between computers:
     –   Email address books.
     –   URL links.
•   Network topology affects the rate and extent of spreading:
     –   Epidemiological models, and the epidemic threshold.
•   Controlling spread on scale-free networks:
     –   Random vaccination is ineffective (e.g., anti-virus software).
     –   Targeted vaccination of high-connectivity nodes.
     –   Control degree distribution in time rather than space.
                 The Social Utility of Privacy
                     Robert Axelrod and Ryan Gerety

•   Typical framing:
     – Privacy values should remain as is (e.g., Lessig).
     – Individual rights vs. state (i.e., civil liberties vs. community safety / crime).
•   A community may have its own interest in defending individual privacy
    (and not), independent of the civil liberties argument:
     – To promote innovation in changing environments.
     – To cope with distortions (e.g., overconfidence of middle managers).
     – To compensate for overgeneralized norms.
•   Not necessarily advocating more privacy:
     – From a societal/informational point of view how should appropriate bounds
       on privacy be determined?
•   Current status:
     – Exploratory modeling based on simple games.
      Next Steps: Negative Representations

•   Distributed negative representations
•   Leaking partial information
•   Relational algebra operators on the negative database:
     – Select, join, etc.
•   Instance difficulty:
     – Hiding given satisfying assignments in a SAT formula
     – Approximate representations
     – Other representations?
•   More realistic implementations
•   Negative data mining:
     – Is it easier/harder to find certain instances in NDB?
•   Imprecise representations:
     – Partial matching and queries
     – Learning algorithms

Stephanie Forrest            Fernando Esponda

   Paul Helman                  Elena Ackley

•   F. Esponda, S. Forrest, and P. Helman ``Negative representations of information.''
    International Journal of Information Security (submitted March 2005).
•   F. Esponda, E.~S. Ackley, S. Forrest, and P. Helman ``On-line negative databases.'' Journal
    of Unconventional Computing (in press).
•   F. Esponda, S. Forrest, and P. Helman. ``A formal framework for positive and negative
    detection.'' IEEE Transactions on Systems, Man, and Cybernetics 34:1 pp. 357-373 (2004).
•   J. Balthrop, S. Forrest, M. Newman, and M. Williamson.``Technological networks and the
    spread of computer viruses.'’ Science 304:527-529 (2004).
•   H. Inoue and S. Forrest ``Inferring Java security policies through dynamic sandboxing.''
    "2005 International Conference on Programming Languages and Compilers (PLC'05) (in
•   F. Esponda, E. Ackley, S. Forrest, and P. Helman. ``On-line negative databases.'' Third
    International Conference on Artificial Immune Systems (ICARIS) Best paper award (2004).

                                            | DB |
     F1  P(x  DB | x  NDB fj ) 
                                       |U |  | NDB fi |

                                   | DB |  | DB fj |
     F2  P(x  DB | x  DB fj ) 
                                     |U |  | DB fj |
     HN (x)  F1 log2 F1  (1 F1)log 2 (1 F1)
     HP (x)  F2 log2 F2  (1 F2 )log 2 (1 F2 )

                                BACK
    Generating Hard-to-Reverse Negative Databases

                                                                                              Instance Difficulty (l=64)





                                          Decisions (zchaff)
    The randomized algorithm can be                            600

    used to create a negative database.                        400

    Insert/Delete operations turn known                        200

    hard formulas into negative                                100


    databases.                                                           1         2          3        4        5          6
                                                                                           Specified bits per record (k-SAT)
                                                                                                                                   7       8

•   The Morph operator may be used to
                                                                                 Instance Difficulty (Glassy8 formula l=64)
    search for hard instances.


                                          Decisions (zchaff)

                                                                                                                                               Original NDB
                                                                                                                                               Updated NDB



                                                                             1         2      3        4       5       6       7       8
                                                                                           Specified bits per record (k-SAT)

                                          H. Jia, C. Moore and B. Selman "From spin glasses to hard satisfiable
                                          formulas” SAT 2004.
              Effect of the Morph operation

•   The Morph operation takes as input
    a negative database NDB and
    outputs NDB’ that represents the
    same set U-DB.
•   The plot shows how the complexity
    of a database changes after
    applying the morph operator.

To top