SCAR Scatter Conceal and Recover

Document Sample
SCAR Scatter Conceal and Recover Powered By Docstoc
					                  SCAR
       Scatter Conceal and Recover
                                  How to throw stuff.
                                  How to hide stuff.
                                    How to find it.
                                                              Bryan Mills
Masters Defense - Feb. 12, 2007                  University of Pittsburgh
                  Agenda
• Introduce architecture of SCAR
   Describe Goals
   Design Details
• Describe analysis of SCAR
   Reliability
   Security
   Tradeoff
• Implementation
             The Question

• How can I securely and reliably store data
  using publicly available storage?
• Technologies promise to provide globally
  available public storage
   Peer-to-peer (P2P)
   Commercial offerings (Amazon S3, box.net)
• How can P2P technologies provide secure
 and reliable storage?
              P2P Technologies
• Basic P2P infrastructure types
   Un-structured (Gnutella)
     • Peers join network to form a random graph
     • Routing is done back flooding packets
   Hybrid (KaZaA)
     • Use servers or “super-nodes” to reduce latency
     • Some structure is in place
   Structured (Chord, Pastry, CAN)
     •   Peers join network to form a “structure”
     •   Each node is assigned ownership a specific section
     •   Typically a section is a range of possible hash values
     •   On top of this a Distributed Hash Table (DHT) is built
                         A DHT
                         Node is responsible
                          for some range
          N1
                              of values
                            Say 200-500

                    N2
                                                  The DHT
N6                                         Hash Value       Data
                                          230           My Data
                         N3
                                          562           Photos
                                          942           Stolen mp3
     N5

               N4
         Current Process for DHT

Secret     Symmetrical   Encrypted
 Data       Encryption    Secret




                           Insert         Name
           Password
                         into DHT      (filename)




                                     Public DHT
        Insert into DHT
                      Encrypted Secret




                                               Name
             Determine                      (filename)
                                  Hash
            Responsible
                                 Function
               Node


Send Data
 to Node
             What’s Wrong?

• Encryption might not be enough
   Encryption might become computable
   Cryptoanalysis
• Easy to get Data
   Constant hashing function (just need filename)
• Reliability of Data
   Single storage location
     • Replication impacts security
What should we do?


                           Data




                  Secret



           My
         Secret
  My      Data
                      Goals
• Tracking the un-trackable
   Predictable way of determining storage locations
   BUT hard to determine these locations without proper
    credentials
• Provide reliable recovery of data
   Nodes can fail so how do we ensure we can recover
    the data
   Higher chance of failure because multiple nodes are
    required for recovery
• Work with any DHT
                Achieving Goals
• Use hash chains to produce storage locations
    Seed chain with password
    Trackable only by users with password
• Reliability
    Most DHT’s use replication
      • Security suffers (attacker has multiple targets)
    SCAR uses Erasure Codes
      • Encode data into n pieces such that only k pieces are
        required for data recovery (nk)
           Reed-Solomon
           IDA (Information Dispersal Algorithm)
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing
                                                 Pre-Processing
          Yes
  Add Header
    File Header


                                                        DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins           Insert Bins
 Using Hash Chains          Header Information        Using DHT
               Pre-Processing
• To use IDA we need to make sure the data is
  large enough
   Pad the data such that the entire file is needed to
    recover (secret sharing)
     • Scar uses simple XOR to achieve this property
• Header information
   Checksum                               File Size
   Original file size           SSA-1 Hash      Padding Size
   Padding size                       Encrypted Data
• Encrypt the data?
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
                                How IDA Works
     Map from k pieces to n pieces
     Use matrix multiplication producing n pieces of
      length (n/k)
      a11   a21    ... ak1   b1                   b2            ...  bs / k  p11      p12    ... p1s / k 
                                                                                                          
      a12   a2 2   ... ak2   b(s / k )1
                               
                                                     b(s / k )2      ... b2(s / k )  p21
                                                                                       
                                                                                                p2 2   ... p2 s / k 
      ...    ...   ... ...   ...                  ...           ... ...   ...         ...    ... ... 
                                                                                                          
      a1n   a2 n   ... akn   (k1)(s / k )1 b(k1)(s / k )1
                                 b                                    ... bk(s / k )  pn1   pn 2   ... pn s / k 



     Key Vectors                              The original data                               Encoded Data
          nxk                                        k x s /k                                      n x s/k
                       Data Matrix
Given some file f of size s bytes
  f = (b1,b2,b3,…,bs)
Divide f into k pieces and put them in a matrix (D)
  of k rows and s/k columns
                                                                      k rows
             b1                   b2            ...    bs / k  s/k columns
                                                                 
         D  b(s / k )1       b(s / k )2      ...   b2(s / k ) 
             ...                  ...           ...     ... 
                                                                 
             b
             (k1)(s / k )1 b(k1)(s / k )1   ...   bk(s / k ) 
               Key Vectors
Create n vectors containing k elements each being
  linearly independent.
Put these n vectors in a matrix (A) of n columns
  and k rows

                  a11   a21    ... ak1 
                                       
               A a12   a2 2   ... ak2 
                  ...    ...   ... ... 
                                       
                  a1n   a2 n   ... akn 
                                How IDA Works
     Putting these together we have a mapping between
       k original data pieces to n encoded pieces

      a11   a21    ... ak1   b1                   b2            ...  bs / k  p11      p12    ... p1s / k 
                                                                                                          
      a12   a2 2   ... ak2   b(s / k )1
                               
                                                     b(s / k )2      ... b2(s / k )  p21
                                                                                       
                                                                                                p2 2   ... p2 s / k 
      ...    ...   ... ...   ...                  ...           ... ...   ...         ...    ... ... 
                                                                                                          
      a1n   a2 n   ... akn   (k1)(s / k )1 b(k1)(s / k )1
                                 b                                    ... bk(s / k )  pn1   pn 2   ... pn s / k 



     Key Vectors                              The original data                               Encoded Data
          nxk                                        k x s /k                                      n x s/k
             Decoding Data
We have n pieces of data and we can decode the
  original data using any k pieces
Given any k pieces we then construct a new matrix
  with the corresponding key vectors
We take the inverse of this new matrix, guaranteed
  to exist because the key vectors were linearly
                     a a ... a  t t  ... t 
  independent        
                         11     21
                                  
                                           k1
                                                 
                                                           11    21         k1

                        a1    a2 2   ... ak2             t1 t 2 2 ... t k2 
                  T   2                        T 1   2
                      ...   ...    ... ...           ... ... ... ...
                                                                        
                      a1k   a2 k   ... akk           t1k t 2 k ... t kk 
              Decoding Data
This gives us the mapping from our k encoded
  pieces into our original k data pieces

   T-1  [k pieces of data] = [original data pieces]
      How IDA Works (bit of detail)
All math is done within a Galois field (or finite
  field), specifically one that is a prime field
  denoted GF(p)
All math is done modulo the p
Values within the matrices must be less than
  the value p (this includes D and A)
Since we are dealing the bytes (256) we can
  use p = 257 (convenient hey?)
            IDA and SCAR

• Algorithm to generate key vector matrix is
 the same for any values of n and k
   This means we don’t need to store this matrix
• We know the order because of the hash
 chaining process
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
     Tracking the un-trackable
• Hash chains to produce storage locations
• Use password in the location generation
                                    Not used
                                          as location
  process
  L0 = hash(filename+password+username)
  L1 = hash(filename+password+username+L0)
  L2 = hash(filename+password+username+L1)
  …
  Ln = hash(filename+password+username+Ln-1)
• If collision occurs then skip to next location
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
              Storage Bins

• All n pieces receive a header
   Used for error detection during recovery
   Detection of hash collisions
• Header
   Checksum of data
   Hash of next and previous storage locations
                  SSA-1 of Contents
            SSA-1(Ln-1+Ln+1+password+k+n)
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
          Inserting Into DHT

• Know storage location and have storage
 bins
   Insert each bin a specified location
• SCAR runs outside the DHT
   SCAR can use any DHT
   SCAR can operate across DHT’s
          SCAR’s Process Flow

       Big           No       Pad Data
     Enough?
                          Using Secret Sharing

          Yes
  Add Header
    File Header


                                                   DHT
Produce n Blocks
 IDA Encoded (k,n)



Storage Locations         Pack Storage Bins      Insert Bins
 Using Hash Chains          Header Information   Using DHT
               Overview
                          DHT
      Secret
       Data


IDA and Bin Packing
               Overview
                          DHT
      Secret
       Data


IDA and Bin Packing
       Overview
                    DHT




   Scatter Data
Using Hash Chains
                 Analysis
• How do we choose the number…
   of pieces generated? (n)
   of pieces required? (k)
• Model node availability using on/off model
• Model security provided by SCAR
• Explore tradeoff between security and
 reliability
                      Node Availability

 • Nodes cycles between being on and off
     (available and unavailable)
      Use random variable Si(t)
                           1 If node i is available at time t
                  Si (t)  
                           0 otherwise
                                                                     (
                                                        Current Time t )
                         onp          onp+1
        Si(t)=1


        Si(t)=0
                             offp            offp+1                      time
           Node Availability

• Using this model we can define the nodes
 average availability using the expected
 value of the nodes on and off time
                                     E[oni ]
        Ai  lim P[Si (t)  1] 
             t                 E[oni ]  E[offi ]

• How are going to calculate the expected
 on and off periods?

                                            Node Availability

• We use a probability model
   This gives us the expected on and off times

                                                           cTm       d
                                                        Q ui ki e™ an a
                                                         one         s
                                                        N decom pr esor
                                                        e               s c
                                                   ar e neded t osee t hi pi t ur e.




                   cTm
                Q ui ki e™ an ad                                                                        cTm
                                                                                                     Q ui ki e™ an ad
                one          s
               N decom pr esor                                                                       one          s
                                                                                                    N decom pr esor
             e               s c
        ar e neded t osee t hi pi t ur e.                                                         e               s c
                                                                                             ar e neded t osee t hi pi t ur e.




                                                           cTm       d
                                                        Q ui ki e™ an a
                                                         one         s
                                                        N decom pr esor
                                                        e               s c
                                                   ar e neded t osee t hi pi t ur e.


                                              on                                       off




     - Probability of going on to off (turning off)
     - Probability of going off to on (turning on)
                 Node Availability

• We define 3 classes of users
   Infrastructure
   Power User
   Peepers
   Type                                  Node Availability*

   Infrastructure          1.0%    95.0%   98.0%

   Power User              20.0%   40.0%   65.0%

   Peepers                 80.0%   10.0%   15.0%

  * Experimental results
                Data Availability

• We can now describe a single nodes
  availability
• Data availability using SCAR is:
                          nn 
                      u) i u n i
                                (1
                             i
                      i k  


  u - mean probability that a given node is unavailable (1-A)
  k - number of required pieces
  n - number of pieces
      total
             Validation of Model

• Simulate recovery of data from nodes
Simulation                 Model




                N=1000; u=10%; n=10
           Validation of Model

• Lets look a bit closer




 Erasure coding effectiveness drop significantly after the
 nodes availability drops below 80%
               Security Model

• Measure the level of difficulty for an
  attacker
                            1
                         c 
                          k!
                         k 
  c - perceived capacity
  k - number of required pieces
               
• Increasing perceived capacity increase
  security
             Security Model

• Want to increase the perceived capacity
• Multiple ways of doing this:
   Not allow attacker to see transactions
   Introduce “fake” data
   Batch transactions
             Security Model

• This measures “hardness” not a probability
  of breaking and we want to compare
  availability to security
• Lets represent “hardness” in terms of
  probability of breaking during a specified
  period of time
   Probability of cracking data in 5 years
                   Tradeoff Model

• This allows us to combine the two models
     We can determine better values for k and n

TradeOff =
    W1 * P[Data availability] + W2 * P[Breaking data in time t]




  Set W1 and W2 to 0.5
              Tradeoff Model

Start by fixing n=10
Vary k and see effect on
  tradeoff metric


                           Notice peak at k=8
                           Fix k=8 and vary n
                           Notice that n=11 is big
                              improvement over n=10

              Gives us n=11 and k=8
            Tradeoff Model

• Depends heavily upon the network nodes
 availability
               Simulation

• SCAR’s sensitivity to node availability
            Implementation
• Implemented SCAR with the ability to use
 any DHT implementation
   Just need put and get operations
• Currently uses OpenDHT’s web services
  API
• Implemented in Python
• Provides a command line utility to store
  and retrieve data
                   Future Work
• Multiple entry points
• Scatter data across systems
• Same technique to secure network traffic
   Data travels along different paths
• Effect on Latency
• Experiment with other distributions
   On/Off model not symmetrical
• Firefox plugin
   Store private data using SCAR
 Questions?




   Bryan Mills
bmills@cs.pitt.edu

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:6/16/2012
language:
pages:55