; Tables
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>



  • pg 1
									         Succinct Data Structures

                Ian Munro

May 09
               General Motivation

    In Many Computations ...
      Storage Costs of Pointers and Other
      Structures Dominate that of Real Data
    Often this information is not “just random

    How do we encode a combinatorial object
      (e.g. a tree) of specialized information …
      even a static one
      in a small amount of space & still perform
      queries in constant time ???

May 09
         Succinct Data Structure

Representation of a combinatorial object:

 Space requirement “close to” information
  theoretic lower bound
Time for operations required of the data
  type comparable to that of representation
  without such space constraints (O(1))

May 09
    Example : Static Bounded Subset

Given: Universe of n elements [0,...n-1]
and m arbitrary elements from this universe
Create: a static structure to support search
  in constant time (lg n bit word & usual ops)
Using: Essentially minimum possible # bits
          … lg((m))
Operation: Member query in O(1) time
(Brodnik & M.)
May 09
         Careful .. Lower Bounds

Beame-Fich: Find largest less than i is tough
 in some ranges of m (e.g. m≈2 √lg n)

But OK if i is present this can be added
 (Raman, Raman, Rao)

May 09
              Focus on Trees

.. Because Computer Science is .. Arbophilic

Directories (Unix, all the rest)
Search trees (B-trees, binary search trees,
  digital trees or tries)
Graph structures (we do a tree based search)

and a key application
Search indices for text   (including DNA)
 May 09
                Preprocess Text for Search
               A Big Patricia Trie/Suffix Trie
                            0   1



         Given a large text file; treat it as bit vector
         Construct a trie with leaves pointing to unique
            locations in text that “match” path in trie (paths
            must start at character boundaries)
         Skip the nodes where there is no branching ( n-1
            internal nodes)
May 09
              Space for Trees

Abstract data type: binary tree
Size: n-1 internal nodes, n leaves
Operations: child, parent, subtree size, leaf data
Motivation: “Obvious” representation of an n
   node tree takes about 6 n lg n bit words
   (up, left, right, size, memory manager,
   leaf reference)
i.e. full suffix tree takes about 5 or 6 times
   the space of suffix array (i.e. leaf
   references only)
May 09
         Succinct Representations of Trees
Start with Jacobson, then others:

Catalan number
= # ordered rooted trees,
  same number of binary trees
≈ 4n/(πn)3/2
Lower bound on specifying is
about 2n bits
What are natural representations?
May 09
          Arbitrary Order Trees

Use parenthesis notation
Represent the tree

As the binary string (((())())((())()())):
  traverse tree as “(“ for node, then
  subtrees, then “)”
Each node takes 2 bits
May 09
                  About Heaps

Only 1 heap (shape) on n nodes                    1

Balanced tree,
                                          2           3
bottom level pushed left
number nodes row by row; 4                    5   6       7

lchild(i)=2i; rchild(i)=2i+1
                               8   9 10

May 09
                  About Heaps

Only 1 heap (shape) on n nodes                    18   1

Balanced tree,                 12                                   16
                                 2                              3
bottom level pushed left
                           6                                                 4
number nodes row by row; 4       10               5    6
                                                           15            7

lchild(i)=2i; rchild(i)=2i+1   1
                                   5          9
                               8       9 10

Data: Parent value > child
This gives an implicit data structure for
  priority queue
May 09
Generalizing: Heap-like Notation for any Binary Tree

Add external nodes                               1
Enumerate level by level
                                     1                   1

                             1               0   1               1

                     1           0       0                   0       0
             0           0                   0               0

Store vector 11110111001000000 length 2n+1
(Here we don’t know size of subtrees; can be overcome. Could
  use isomorphism to flip between notations)
 May 09
         How do we Navigate?
Jacobson’s key suggestion:
  Operations on a bit vector
rank(x) = # 1’s up to & including x
select(x) = position of xth 1

So in the binary tree

leftchild(x) = 2 rank(x)
rightchild(x) = 2 rank(x) + 1
parent(x) = select(x/2)

May 09
          Heap-like Notation for a Binary Tree

Add external nodes                               1
Enumerate level by level
                                     1                   1               Rank 5

                                                                             Node 11
                             1               0   1               1

                     1           0       0                   0       0
                0        0                   0               0

Store vector 11110111001000000 length 2n+1
(Here don’t know size of subtrees; can be overcome. Can also
  use isomorphism to flip between notations)
 May 09
             Rank & Select
Rank: Auxiliary storage ~ 2nlglg n / lg n bits

#1’s up to each (lg n)2 rd bit
#1’s within these too each lg nth bit
Table lookup after that

Select: More complicated (especially to get
 this lower order term) but similar notions
Key issue: Rank & Select take O(1) time
 with lg n bit word (M. et al)
May 09
         Lower Bound: for Rank & for Select
Theorem (Golynski): Given a bit vector of length n
  and an “index” (extra data) of size r bits, let t be
  the number of bits probed to perform rank (or
  select) then: r=Ω(n (lg t)/t).
Proof idea: Argue to reconstructing the entire string
  with too few rank queries (similarly for select)

Corollary (Golynski): Under the lg n bit RAM model,
  an index of size (n lglg n/ lg n) is necessary and
  sufficient to perform the rank and the select
May 09
            Other Combinatorial Objects

    Planar Graphs (Lu et al; Barbay et al)
    Permutations [n]→ [n]
    Or more generally
    Functions [n] → [n]                But what operations?
          Clearly π(i), but also π -1(i)
          And then π k(i) and π -k(i)
    Suffix Arrays           (special permutations; references to
         positions in text sorted lexicographically) in linear space

    Arbitrary Graphs (Farzan & M)
May 09
         Permutations: a Shortcut Notation
Let P be a simple array giving π; P[i] = π[i]
Also have B[i] be a pointer t positions back
  in (the cycle of) the permutation;
  B[i]= π-t[i] .. But only define B for every
  tth position in cycle. (t is a constant;
  ignore cycle length “round-off”)

          2       4       5   13       1       8       3     12        10

So array representation
P = [8 4 12 5 13 x x 3 x 2 x 10 1]
          1   2       3   4   5    6       7   8   9   10   11    12   13
May 09
             Representing Shortcuts

In a cycle there is a B every t positions …
But these positions can be in arbitrary order
  Which i’s have a B, and how do we store it?
Keep a vector of all positions: 0 = no B 1 = B
Rank gives the position of B[“i”] in B array
So: π(i) & π -1(i) in O(1) time & (1+ε)n lg n bits

Theorem: Under a pointer machine model with
  space (1+ ε) n references, we need time 1/ε
  to answer π and π -1 queries; i.e. this is as
  good as it gets … in the pointer model.
May 09
                Getting n lg n Bits
This is the best we can do for O(1) operations
But using Benes networks:
1-Benes network is a 2 input/2 output switch
r+1-Benes network … join tops to tops
#bits(n)=2#bits(n/2)+n=n lg n-n+1=min+(n)
            1                         3
            2                         5

                    R-Benes Network
            3                         7
            4                         8

            5                         1
            6                         6
                    R-Benes Network
            7                         4
            8                         2

May 09
                 A Benes Network
Realizing the permutation (std π(i) notation)
π = (5 8 1 7 2 6 3 4) ; π-1 = (3 5 7 8 1 6 4 2)
Note: (n) bits more than “necessary”
             1                  3
             2                  5

             3                  7
             4                  8

             5                  1
             6                  6

             7                  4
             8                  2

May 09
         What can we do with it?
Divide into blocks of lg lg n gates … encode
  their actions in a word. Taking advantage
  of regularity of address mechanism
and also
Modify approach to avoid power of 2 issue
Can trace a path in time O(lg n/(lg lg n)
This is the best time we are able get for π
  and π-1 in nearly minimum space.

May 09
              Both are Best
Observe: This method “violates” the pointer
  machine lower bound by using
But …
More general Lower Bound (Golynski): Both
  methods are optimal for their respective
  extra space constraints
(Note: backpointer approach took (n lg n)
  extra bits)

May 09
         Back to the main track: Powers of π
Consider the cycles of π
( 2 6 8)( 3 5 9 10)( 4 1 7)
Bit vector indicates start of each cycle
( 2 6 8 3 5 9 10 4 1 7)
Ignore parens, view as new permutation, ψ.
Note: ψ-1(i) is position containing i …
So we have ψ and ψ-1 as before
Use ψ-1(i) to find i, then bit vector (rank,
  select) to find πk or π-k
May 09
Now consider arbitrary functions [n]→[n]
“A function is just a hairy permutation”
All tree edges lead to a cycle

May 09
             Challenges here
Essentially write down the components in a
  convenient order and use the n lg n bits to
  describe the mapping (as per permutations)
To get fk(i):
Find the level ancestor (k levels up) in a
Go up to root and apply f the remaining
  number of steps around a cycle
May 09
            Level Ancestors
There are several level ancestor techniques
O(1) time and O(n) WORDS.
Adapt Bender & Farach-Colton to work in
  O(n) bits

But going the other way …

May 09
              Level Ancestors
Moving Down the tree requires
f-3( ) = ( )
The trick:
Report all nodes on a given
   level of a tree in time
   proportional to the number of
   nodes, and
Don’t waste time on trees with
   no answers
May 09
            Final Function Result

Given an arbitrary function f: [n]→[n]
With an n lg n + O(n) bit representation we
  can compute fk(i) in O(1) time and f-k(i) in
  time O(1 + size of answer).

f & f-1 are very useful in several applications
… then on to binary relations (HTML markup)

May 09
Interesting, and useful, combinatorial objects can
Stored succinctly … O(lower bound) +o()
So that
Natural queries are performed in O(1) time (or at
  least very close)
Programs: http://pizzachili.dcc.uchile.cl/index.html

This can make the difference between
 using the data type and not …

May 09

To top