A Lightning-Fast INDEX Drives Massive DATA Analysis

Document Sample
A Lightning-Fast INDEX Drives Massive DATA Analysis Powered By Docstoc
					FA S T B I T

A Lightning-Fast INDEX
     Drives Massive
  DATA Analysis
                            FastBit is an extremely efficient indexing technology for accelerating database
                            queries on massive datasets. FastBit enhances conventional bitmap indexing
                            technology by employing advanced compression, encoding, and binning methods.
                            Previously, bitmap indexes were only considered useful for categorical data with a
                            small number of possible values. Now, many applications (especially scientific
                            applications) require indexing over data with a very large number of possible values.
                            Because of specialized enhancements, FastBit is fast enough to support real-time
                            queries for scientific data exploration applications, such as visual analytics. In many
                            applications, FastBit can search data 10–100 times faster than other products.

                            As computers become ever more powerful, they

                                                                                                                                                                         I L L U S T R A T I O N : A. T O V E Y S O U R C E : W U , O T T O , A N D S H O S H A N I 20 04
FastBit is a database
indexing system, designed   collect and produce more bytes of data. Making                                         10
primarily for answering     sense of all the bytes is becoming a central challenge

queries efficiently, and

                            of many scientific endeavors. Often, a small fraction                                             +         +         +       +          +
                                                                                      Query Response Time (Sec)

can search huge             of the data records holds the key to insight; there-                                      +                          +
databases much more                                                                                                  1

                            fore, an efficient tool to locate and retrieve key data

quickly than the fastest

                            records is essential. In the past 30 years, database
commercially available
database management
                            management systems have emerged in industry as
system.                     the most prevalent tools for such tasks.                                               0.1
                               A database management system imposes certain                                                                  + B*-tree
                            structures on the data records, typically as tables                                                                  DBMS Bitmap Index

                            with rows and columns, and requires all questions                                                                    FastBit Index
                            (queries) to be in a machine processable language,                                    0.01
                                                                                                                     1e-06   1e-05   0.0001 0.001        0.01    0.1
                            such as the Structured Query Language (SQL). The
                                                                                                                                      Fraction of Hits
                            system optimizes the query-answering process by
                            implementing auxiliary data structures to acceler-        Figure 1. The average query response time measured
                            ate common types of questions. These acceleration         on the client to answer the same set of queries on a
                            techniques are known as database indexes. FastBit         set of high-energy physics data. Both B*-tree and the
                            is one such database indexing system, designed pri-       database management system bitmap index are from
                            marily for answering queries efficiently.                 well-regarded commercial database management
                               FastBit can search huge databases much more            systems. Bitmap indexes from both the database
                            quickly than the fastest commercially available           management system and FastBit are compressed
                            database management system (figure 1). For exam-          basic bitmap indexes; the only difference is the
                            ple, researchers from the University of Hamburg,          compression method used. Across the whole range of
                            Germany used FastBit to accelerate their drug dis-        queries, FastBit is on average 14 times faster than the
                            covery software by 140–250 times. Engineers at a          commercial bitmap index. The FastBit index is 4–30
                            major Internet company found that FastBit can             times faster than the B*-tree.

32                                                                                                                  SCIDAC REVIEW    FALL 2009   WWW.SCIDACREVIEW.ORG
Bitmap Index: An Example
Typically, a bitmap index is built for each         first bit of B2 is set to 1 because the value of X

                                                                                                                                                      I L L U S T R A T I O N : A. T O V E Y
variable of a dataset. In the illustration shown    is 2 in the first row (with RID 0). To answer a        Base Data              Bitmap Index
in figure 2, we take a one-variable dataset as      query involving X > 1, one can produce a                RID      X         B0 B1 B2 B3
an example. The variable is named X and its         bitmap representing the rows that satisfy the              0      2         0     0    1    0
values are listed in the second column of the       condition by ORing B2 and B3. Similarly, other             1      1         0     1    0    0
table. The first column under the heading of        conditions involving X alone can be answered
                                                                                                               2      3         0     0    0    1
RID contains row identifiers used internally in     with bitwise logical operations.
most database management systems. The                   Figure 2 shows the simplest form of the                3      0         1     0    0    0
variable X contains integers with four possible     bitmap index known as the basic bitmap index.              4      3         0     0    0    1
values, 0–3. Corresponding to these four            The first commercial implementation of a                   5      1         0     1    0    0
values, there are four bitmaps, shown as four       bitmap index in a database system, called                  6      0         1     0    0    0
columns in the table on the right under the         Model 204, is a hybrid version of this basic               7      0         1     0    0    0
heading B0–B3, each bitmap corresponding to         bitmap index and the B-tree index. Popular                 8      2         0     0    1    0
one possible value of X. A bit in a bitmap is set   database management systems, such as
to 1 if the value is equal to the value             Oracle, use a compressed version of this basic Figure 2. A logical view of the basic bitmap
represented by the bitmap and 0 otherwise. For      bitmap index. The bitmaps in such an index         index for a variable named X. Columns B0–B3
example, B2 represents the value 2, and the         are easy to compress.                              are bitmaps.

match Web pages with advertisements at least 10               small number of records) efficiently, while most
times, and in many cases 100 times, faster than               scientific data analysis tasks require a much larger
commercial database technologies. And visualiza-              number of records. Furthermore, scientific data
tion experts at Lawrence Berkeley National Labo-              analyses are much more ad hoc, where users usu-
ratory (LBNL) showed that FastBit can identify and            ally examine many combinations of conditions on
track particles in a laser wakefield particle acceler-        different attributes. All of these present tremen-
ator simulation 1,000 times faster than previous              dous challenges for existing indexing methods.
tools. In addition, academic groups and commer-                  Of the many indexing methods intended to
cial companies use FastBit for activities such as             accelerate search operations, the most popular is
image analysis, computer network security,                    the B-tree index. However, B-tree indexing meth-
national security, and routing of Voice over Inter-           ods fail to meet the stringent demands of modern
net Protocol (VOIP). FastBit’s efficient search capa-         data analysis, such as interactive visual analysis
bility is critical to these applications.                     over terabytes of data, or locating the mastermind
   FastBit was originally designed to help particle           behind distributed denial-of-service attacks while
physicists sort through billions of data records to           the attacks are in progress. These queries return
find a few key pieces of information. For example,            thousands of records that require a large number
in a high-energy physics experiment called STAR,              of tree-branching operations that translate into
colliding particles generate billions of collision            slow pointer chases in memory and random
events, but only a few hundred of these might have            accesses on disk, thus taking a long time.
the most distinctive signatures of a new state of                One reason for this behavior is that these tree-         Visualization experts at
matter—called quark–gluon plasma—and finding                  based indexing structures are designed for data             LBNL showed that FastBit
evidence of this plasma is a major objective of the           that change frequently over time. They are opti-            can identify and track
STAR experiment.                                              mized for finding a very small number of records,           particles in a laser
   In such an application, the dataset includes hun-          such as finding a bank account and updating the             wakefield particle
dreds of searchable attributes, and the base data             balance. They are not well-suited for locating thou-        accelerator simulation
records are not updated once they are generated.              sands of records, as is necessary in a typical analy-       1,000 times faster than
In contrast, the common transactional applica-                sis of large datasets. Many popular indexing                previous tools. In
                                                                                                                          addition, academic
tions for which the database systems were                     techniques, such as hash indexes, have similar
                                                                                                                          groups and commercial
designed typically contain only a small number of             shortcomings. Furthermore, B-tree indexes inflate           companies use FastBit for
searchable attributes. For example, a banking                 the volume of the data stored by a factor of three          activities such as image
application might only be able to search for                  to four, which is not generally acceptable when             analysis, computer
accounts based on account number and customer                 dealing with terabyte-sized datasets.                       network security, national
name. In addition, a database management system                  B-tree and variants are primarily designed for           security, and routing of
is designed to locate an individual record (or a very         searching one variable at a time. For searching             VOIP.

SCIDAC REVIEW   FALL 2009    WWW.SCIDACREVIEW.ORG                                                                                                    33
                                         FA S T B I T
I L L U S T R A T I O N : A. T O V E Y

                                                                                                                              Vertical Data Organization
                                           1: Start with 186 Bits                                                             The first technique used in FastBit is vertical data
                                           10010001110100000101100000100000000000000000000000000000000000000000000000-        organization. The use of this technique grew out of
                                           11111111111111111111111111111111111111                                             the need to search on a relatively small number of
                                                                                                                              variables from a massive dataset with a large num-
                                           2: Parse into 31-bit groups               3: Encode the Groups
                                                                                                                              ber of searchable attributes.
                                           1001000111010000010110000010000           1001000111010000010110000010000
                                           0000000000000000000000000000000                                                       For example, the first application we tackled was
                                           0000000000000000000000000000000           01001000111010000010110000010000         a high-energy experiment that produced billions of
                                           1111111111111111111111111111111                                                    data records, or “events,” each associated with hun-
                                           1111111111111111111111111111111           0000000000000000000000000000000          dreds of searchable variables. However, a typical
                                           1111111111111111111111111111111           0000000000000000000000000000000          search only involves a handful of variables, mak-
                                                                                                                              ing it highly desirable to partition the data by vari-
                                                                                                                              able and to store their values in separate files to avoid
                                                                                     1111111111111111111111111111111          reading unnecessary data from disk. This way of
                                                                                                                              organizing data, known in the database community
                                                                                                                              as “vertical partitioning,” is well-suited for scientific
                                                                                     11000000000000000000000000000011         applications and data warehousing applications
                                           4: WAH Compressed Words                                                            where existing records are not modified.
                                           01001000111010000010110000010000 10000000000000000000000000000010                     In contrast, the traditional horizontal data organ-
                                           11000000000000000000000000000011                                                   ization makes it necessary to retrieve all variable val-
                                                                                                                              ues of a record if any values are needed. The
                                         Figure 3. An illustration of how WAH compression works.                              horizontal data organization is well-suited for trans-
                                                                                                                              actions, such as recording changes in a department
                                                                                                                              store’s inventory. However, a scientific data analysis
                                         FastBit significantly        multiple variables simultaneously, other tree-based     is very rarely such a transaction. In most cases, a sci-
                                         improves the speed of a      indexing techniques, such as Oct-tree and KD-tree       entific dataset like a commercial data warehouse is
                                         searching operation on       indexes, are more appropriate. Such indexes are         modified through append operations only, and the
                                         both high- and low-          designed for queries that involve all variables         majority of queries only touch a subset of the vari-
                                         cardinality values with a    indexed. In many applications, the user only            ables. The vertical data organization is well-suited
                                         number of techniques,        searches on a handful of variables out of hundreds.     for such data access patterns.
                                         including a vertical data
                                                                      In these cases, these tree-based indexes are very
                                         organization, an
                                         innovative bitmap            ineffective. In scientific applications where users     Word-Aligned Hybrid Compression
                                         compression technique,       often query on an arbitrary combination of vari-        The second key technology in FastBit is an innova-
                                         and several new bitmap       ables, limiting searches to any subset of combina-      tive compression method called Word-Aligned
                                         encoding methods.            tions may limit the possibility of discovery. To        Hybrid (WAH) compression. It reduces index sizes
                                                                      meet all these demands, another type of index           by compressing each individual bitmap separately,
                                                                      must be used.                                           while at the same time allowing operations on the
                                                                         For tasks that demand the fastest possible query     compressed bitmaps to proceed efficiently.
                                                                      processing speed, bitmap indexes have shown                Typically, a bitmap index resides on a disk file sys-
                                                                      promise after a number of years of active use in        tem or another secondary storage system. To answer
                                                                      some commercial database management systems             a query, the relevant part of the index is read into the
                                                                      (sidebar “Bitmap Index: An Example” p33). How-          computer’s memory before computation. The con-
                                                                      ever, the effectiveness of these bitmap indexes is      ventional wisdom is that the time needed to read
                                                                      limited to certain types of data, primarily those       parts of the index into memory is much greater than
                                                                      variables with a relatively small range of possible     the time needed to perform the computation after
                                                                      values, such as the gender of a customer or the state   the data are in memory.
                                                                      in which the customer lives. These variables with          However, with existing compression techniques,
                                                                      a relatively small number of possible values are        answering a query using compressed bitmap
                                                                      known as categorical values, or low-cardinality         indexes in fact requires more time to perform com-
                                                                      variables. However, most scientific data contain        putations on the bitmaps than the time to read these
                                                                      variables—such as velocity, temperature, and pres-      bitmaps into memory. Therefore, the most direct
                                                                      sure—that have a very large number of possible          way to improve the efficiency of compressed bitmap
                                                                      values. FastBit significantly improves the speed of     indexes is to make the computations on the com-
                                                                      a searching operation on both high- and low-car-        pressed bitmaps faster. WAH was invented and
                                                                      dinality values with a number of techniques,            patented in 2004 specifically for this purpose. See
                                                                      including a vertical data organization, an innova-      the sidebar “FastBit Compression” for more infor-
                                                                      tive bitmap compression technique, and several          mation about this specialized compute-efficient
                                                                      new bitmap encoding methods.                            compression method.

                                         34                                                                                         SCIDAC REVIEW   FALL 2009   WWW.SCIDACREVIEW.ORG
FastBit Compression
FastBit uses a special compression method             Because WAH counts all bits in 31 bit groups,             bit value. A fill is represented by its length and
called Word-Aligned Hybrid (WAH) code. WAH            bitwise operations are always aligned neatly              the bit value. WAH uses one word to present a
uses a version of run-length encoding for storing     without requiring additional adjustments.                 fill, and this word includes a bit to indicate that it
long sequences of 0s (or 1s) rather than storing      Therefore, the logical operations (such as AND,           is a fill, a bit representing the bit value of the fill,
them literally. However, when alternating 0s and      OR, NOT) on the bitmaps are performed                     and the remaining 30 bits represent the length in
1s that are too close together in the bitmap,         efficiently on the compressed bitmaps and result          number of 31 bit groups. It is possible that a
using counts can take more space. WAH stores          in enormous performance gains.                            dataset contains more rows than can be
such bits literally. Therefore, there are two types        More specifically, when a computer word has          expressed in a 30 bit allocated for storing a fill
of contents under WAH compression, and it             32 bits, WAH logically divides incoming bitmaps           length. In this case, the bitmap will be
reserves one bit to distinguish between them.         into 31 bit groups. A group with a mixture of 0s          fragmented so that the maximum possible fill
   A key idea in WAH is to count bits in 31 bit       and 1s is stored literally in a 32 bit word, with         length of each fragment can be represented by
groups on computers with 32 bit words (rather         the remaining bit indicating that it is storing literal   a 32 bit word.
than the usual 32 bits or 8 bits, as other            bit values. Such a word is called a literal word.              Figure 3 is an illustration of how WAH is able
compression methods do). This order allows            Adjacent groups with only 0s or 1s are called a           to compress 186 bits into three compressed
each 31 bit group to fit in a 32 bit word.            fill. A fill can be a 0-fill or 1-fill depending on the   words.

  In many tests, WAH-compressed indexes have                      floating-point valued variable, such as pressure, all                   The size of a FastBit index
been shown to be 10 times faster than the best-                   the values with the same exponent may be grouped                        is on average one-third
known bitmap indexes. In addition to being com-                   into a single bin. The encoding step takes the bin                      the size of the original
pute-efficient, WAH is also effective in reducing                 identifiers and translates them into bitmaps. For                       data, which is about one-
index size. The size of a FastBit index is on average             example, the integer variable X shown in figure 2                       tenth the size of the
                                                                                                                                          widely used B-tree index.
one-third the size of the original data, which is about           (p33) may be viewed as bin numbers. The encoding
one-tenth the size of the widely used B-tree index.               method used by the basic bitmap index is known as
                                                                  the equality encoding because each bitmap repre-
Encoding Methods for Bitmap Indexes                               sents exactly one bin. If a value for a particular row
The third key technology in FastBit is a new group                falls in a bin, a 1 is placed in the corresponding posi-
of bitmap encoding methods called multi-level                     tion of the bitmap.
encodings. A multi-level encoding method employs                     When each bin contains exactly one value, the
a nested set of bins to reduce the amount of work                 equality encoding can be thought of as storing pre-
required for answering a query. In this case, higher              computed answers to equality queries of the form
levels of a multi-level index can give an approximate             X = a, where a is one of the values that appeared in
solution and lower-level indexes are accessed as                  the dataset for variable X. For a query of the form
needed to refine the solution.                                    a ≤X≤b, many equality encoded bitmaps are
  For a number of years, various attempts have been               needed. In general, the wider a query range [a, b] is,
made to construct multi-level encodings; however,                 the more bitmaps are needed.
none of them has been shown to be definitively                       Another popular encoding method is range
more efficient than simpler encoding methods.                     encoding, which can be viewed as storing pre-com-
Recently, through extensive analyses, FastBit devel-              puted answers to range queries of the form X ≤ a. Its
opers found that optimal performance can be                       first bitmap is the same as the first bitmap under the
achieved with only two levels, and the best of the                equality encoding; its second bitmap is the bitwise
two-level encodings can perform, on average, three                OR of the first two bitmaps under the equality
to five times faster than the best one-level indexes              encoding; and its third bitmap is the bitwise OR of
with WAH compression.                                             the first three bitmaps under the equality encoding.
                                                                  In general, the kth bitmap under the range encoding
Basic Encoding Methods:                                           is the bitwise OR of the first k bitmaps under the
Equality, Range, and Interval Encoding                            equality encoding.
In general, a bitmap index can be constructed in                     A third well-known encoding method is interval
three steps: binning, encoding, and compression. In               encoding, which can be thought of as pre-comput-
the binning step, each variable of the base data is               ing a subset of queries of the form a ≤X ≤ b. Under
scanned and grouped into bins. For example, for a                 interval encoding, each bitmap is the bitwise OR of
string-valued variable, strings starting with the same            about half of the bitmaps under the equality encod-
character may be grouped into a single bin; for a                 ing. Given 100 bins, the equality encoding produces

SCIDAC REVIEW     FALL 2009    WWW.SCIDACREVIEW.ORG                                                                                                                        35

FastBit in Visual Analytics
Colliding high-energy particles is a                 FastBit has been used to quickly find the            display. In figure 5 a focused beam is
fundamental tool for probing the structure of     particles that have undergone wakefield                 displayed (in red) on top of a gray background
elementary particles and understanding the        acceleration and track them throughout the              of a larger subset of particles. This filtering
building blocks of our Universe. However, the     simulation. In addition, FastBit can also               reduces the number of particles involved in the
accelerators are becoming bigger (many miles      efficiently compute histograms needed for the           histogram computation and improves the
long) and more expensive to achieve the           histogram-based parallel coordinate display             overall analysis speed.
increasing energy needs, and this is a            used for visualizing and selecting particles. As           FastBit is also very efficient at tracking
challenge. The Laser Wakefield Particle           in a normal parallel coordinate display, there          particles across time. In figure 6, the particles
Accelerator (LWPA) is a new type of               is one axis (a vertical line in figure 5) for each      selected in figure 5 are shown at different
accelerator that is capable of accelerating       variable in a dataset. However, unlike a normal         timesteps of the simulation. In this task,
particles to very high energy in a much shorter   parallel coordinate display that draws a line for       FastBit indexes can dramatically decrease the
distance. To better understand these              each particle from the simulation, a set of two-        execution time.
accelerators, simulations are conducted using     dimensional histograms are used to provide                 Altogether, using FastBit has significantly
software, such as the VORPAL framework.           the density of lines that would have been               increased the speed of this visual task. For
Such a simulation can produce a very large        displayed. This reduces the amount of work              example, in the first test run of the FastBit
amount of data. For example, a run of VORPAL      needed to produce the parallel coordinate               enabled tracking program, it was able to
may output 100 timesteps of 200 GB each. A        display and enables us to work with much                complete the tracking task in 0.3 seconds
volume rendering of the particles in such a       larger datasets. FastBit can efficiently filter the     whereas the original tracking program was using
simulation is shown in figure 4.                  particles to be shown in the parallel coordinate        5 minutes, a 1,000-fold speedup in this case.

                                                                                                                                                              O. R U B E L E T A L . 2008
                                 Var: rho_0




                                          -1.500e+07      Y
                                 Max: 0.000
                                 Min: -3.876e+08
                                                   Z             X

                             Figure 4. A volume rendering of the density of particles in a laser wakefield particle accelerator.

                             100 bitmaps, but the interval encoding produces 51              and the range encoding. Due to its unique construc-
                             bitmaps. The first bitmap under the interval encod-             tion, it is possible to answer any query of the form
                             ing is the bitwise OR of the first 50 bitmaps under             a ≤ X ≤ b with two bitmaps. It is possible to answer
                             the equality encoding. The second bitmap is the bit-            the same query with two bitmaps under the range
                             wise OR of 50 bitmaps under the equality encoding,              encoding as well. However, because the interval
                             starting with the second bitmap. In general, the kth            encoding produces less bitmaps, we would prefer
                             bitmap under the interval encoding is the bitwise               the interval encoding in general.
                             OR of half of the bitmaps under the equality encod-                In practice, interval encoding is not widely used.
                             ing starting with the kth bitmap.                               Ten years after it was proposed in the research liter-
                               One advantage of interval encoding is that the                ature, no commercial database system has imple-
                             number of bitmaps is about half that of the equality            mented it because the interval-encoded bitmaps do

36                                                                                                      SCIDAC REVIEW    FALL 2009   WWW.SCIDACREVIEW.ORG
                                                                                                                                                          O. R U B E L E T A L . 2008
not compress well. Even with the best possible com-
pression method, the size of an interval-encoded
index could be much larger than the base data.

Multi-Component Encoding
A strategy to curb the index size is to split a bin num-
ber into multiple components. For example, with
1,000 bins numbered from 0 to 999, one may split
the three-digit number into three components
where each decimal digit is considered a single com-
ponent and encode each of the three components
separately. Under the equality encoding, one needs
10 bitmaps for the first (most-significant) digit, 10
bitmaps for the second digit, and 10 for the third
(least-significant) digit. Altogether, it uses a total of
30 bitmaps under the three-component equality
encoding instead of 1,000 under the simple equality
encoding. The number of bitmaps needed for three-
component range encoding and interval encoding
can be reduced similarly.
   The number of bitmaps generated by multi-com-
ponent encoding decreases monotonically as the
number of components increases. With range
encoding and the interval encoding, index size
decreases proportionally as well. In the extreme
case, where the maximum number of components
is used, each component includes two possible val- Figure 5. A parallel coordinate display of a focused beam (red) and its background
ues, which is equivalent to breaking the bin num- (gray) from a set of simulation data for a laser wakefield particle accelerator.
bers into individual binary digits. In this case,
because the two bitmaps representing a binary digit

                                                                                                                                                          O. R U B E L E T A L . 2008
are complements of each other, it is possible to
retain only one of the two bitmaps for each compo-
nent. Retaining the bitmap corresponding to the
binary digit being 1 is equivalent to a popular encod-
ing method called binary encoding; the correspon-
ding index is known as the bit-sliced index. It turns
out that binary encoding is better than the multi-
component versions of all three basic encoding
methods—equality, range, and interval encodings—
with the exception of the one-component equality
encoding. We have proven these results theoretically
and observed them in practice.
   In short, among all multi-component encoding
methods, there are two that perform better than
others: one-component equality encoding and
binary encoding. This distinction is also reflected in
                                                             Z         X
the fact that these are the only encoding methods
used in commercial database management systems.
FastBit uses these multi-component encodings.             Figure 6. Tracks made by selected particles in a laser wakefield particle accelerator. The
                                                            gray dots indicate background particles in a timestep in the middle of the simulation.
Multi-Level Encoding
In addition to the encoding methods described so
far, FastBit also includes a number of multi-level          independently. When answering a query, we gener-
encodings that outperform the best of these multi-          ally use the coarser level indexes first to get an
component encoding methods. These multi-level               approximate answer and use finer levels to identify
encodings are constructed on a hierarchy of bins            the records that cannot be resolved with the coarse
with coarser bins on top and finer bins at the bot-         levels. In this process, a fine-level index is used to
tom. Each level of a multi-level index can be encoded       resolve a range condition with a narrower range and

SCIDAC REVIEW   FALL 2009   WWW.SCIDACREVIEW.ORG                                                                                                     37

                           I L L U S T R A T I O N : A. T O V E Y S O U R C E : W U , S H O S H A N I , A N D S T O C K I N G E R 2 0 0 9
                                                                                                                                                                             12                                              more time than the others. In all cases, we see that
                                                                                                                                                                                      BN                                     the three two-level indexes take less time than the

                                                                                                                                             Query Response Time (Seconds)
                                                                                                                                                                             10       EE                                     bit-sliced index and the basic bitmap index, the two
                                                                                                                                                                                      RE                                     best multi-component encodings.
                                                                                                                                                                             8        IE                                        The interval-equality encoding (IE in figure 7) is
                                                                                                                                                                                                                             on average four times faster than the binary encod-
                                                                                                                                                                             6                                               ing (BN) and seven times faster than the one-com-
                                                                                                                                                                                                                             ponent equality encoding (E1). The performance
                                                                                                                                                                                                                             advantages of these two-level encodings are not only
                                                                                                                                                                              2                                              observed in timing measurements, but are also ver-
                                                                                                                                                                                                                             ified with extensive theoretical analyses.
                                                                                                                                                                              0                                                 In summary, the combined use of multi-compo-
                                                                                                                                                                                  0        2    4        6      8       10   nent interval-equality encoding with the WAH com-
                                                                                                                                                                                               Number of Hits       ×107
                                                                                                                                                                                                                             pression yields 50-fold speedup as compared to the
                                                                                                                                            Figure 7. Query response time of the five best bitmap                            best known bitmap indexes. The WAH compres-
                                                                                                                                            encoding methods available from FastBit: binary                                  sion provides a 10-fold performance gain, and the
                                                                                                                                            encoding (BN), one-component equality encoding (E1),                             multi-component encoding provides an additional
                                                                                                                                            equality-equality encoding (EE), range-equality encoding                         five-fold improvement. Given that bitmap indexes
                                                                                                                                            (RE), and interval-equality encoding (IE). Time is                               perform well for data that do not change over time
                                                                                                                                            measured on a set of 100 million rows of uniform                                 (append-only data) they are generally faster than any
                                                                                                                                            random data. We observe that the two-level encoding                              other known indexing methods, such as tree-based
                                                                                                                                            methods (EE, RE, and IE) never took more time to                                 indexes. Consequently, FastBit is sufficiently fast for
                                                                                                                                            answer the same query than the binary encoding and                               real-time search applications over large datasets.
                                                                                                                                            the one-level equality encoding. On average, the interval-                       Thus, it is particularly useful for dynamic real-time
                                                                                                                                            equality (IE) encoding is 4.3 times faster than the binary                       data exploration and visualization applications.
                                                                                                                                            (BN) encoding.
                                                                                                                                                                                                                             Unique Design, Impressive Performance
                                                                                                                                                                                                                             Combining WAH compression and two-level encod-
                            involving a smaller number of records. Because only                                                                                                                                              ing yields an outstanding speedup of 30–50 times
                            equality encoding can use fewer bitmaps to resolve                                                                                                                                               faster relative to the widely-used bitmap indexing
                            a narrower range condition, a multi-level index can                                                                                                                                              methods, including those from the most popular
                            only use the equality encoding at the fine levels.                                                                                                                                               commercial database management system products.
                              An example of multi-level encoding is the inter-                                                                                                                                                 Another unique feature of WAH-compressed
                            val-equality encoding that uses a set of very coarse                                                                                                                                             indexes is that they are amenable to theoretical
                            bins with the interval encoding and a set of fine bins                                                                                                                                           analysis. Through analyses, WAH-compressed
                            with the equality encoding. Because the interval                                                                                                                                                 indexes, including the WAH-compressed basic
                            encoding only has a very small number of bins at                                                                                                                                                 bitmap index and some of the new multi-level
                            the coarse level, the total size of the bitmaps is mod-                                                                                                                                          bitmap indexes, have the same theoretical optimal-
                            est. At the same time, interval encoding can answer                                                                                                                                              ity as the best of the B-tree variants. That is, the
                            queries quickly, and we only need to examine a                                                                                                                                                   search time is bounded by a linear function of the
                            small number of fine-level equality encoded bitmaps                                                                                                                                              number of records that satisfy the search criteria,
                            to get the final answer. Therefore, the interval-equal-                                                                                                                                          called hits. In theory, the time required by the best
                            ity encoding combines the best characteristics of                                                                                                                                                of the B-tree indexes is also bounded by a linear
                            both the interval and equality encoding.                                                                                                                                                         function of the number of hits, but FastBit is
                              Similar multi-level encodings can be composed                                                                                                                                                  20–100 times faster than B-trees from commercial
                            for other combinations. In figure 7, the timing of                                                                                                                                               database management system products. This dif-
                            three such two-level encodings is shown against two                                                                                                                                              ference is because the size of the FastBit index is
                            of the best multi-component encodings. When a                                                                                                                                                    much smaller and the logical operations are very
Combining WAH               query selects a relatively small number of records,                                                                                                                                              fast on modern computers (compared with tree-
compression and two-level say less than 10% of all records, which is 10 million                                                                                                                                              branching operations). Furthermore, FastBit per-
encoding yields an          of the 100 million records in the test presented in fig-                                                                                                                                         forms extremely well on multi-variable queries
outstanding speedup of
                            ure 7, the binary encoding is slower than four other                                                                                                                                             because the intersection between the search results
30–50 times faster
relative to the widely-used methods because it requires all the bitmaps to                                                                                                                                                   on each variable is a simple AND operation over
bitmap indexing methods, answer a query. In figure 7 the number of records                                                                                                                                                   the resulting bitmaps.
including those from the    selected by a query is referred to as hits for the query.                                                                                                                                          The combination of compression and binning
most popular commercial The number of hits increases, the basic equality                                                                                                                                                     works especially well for scientific data, where the
database management         encoded index needs to work with more and more                                                                                                                                                   majority of the variables have high cardinalities. This
system products.            bitmaps in order to produce the answers, and it takes                                                                                                                                            design also works well for those data with lower car-

38                                                                                                                                                                                                                                 SCIDAC REVIEW   FALL 2009   WWW.SCIDACREVIEW.ORG
FastBit Wins R&D 100 Award
Four researchers in the Scientific Data Management (SDM) Group in

                                                                                                                                               A. S H O S H A N I , LBNL
Berkeley Lab’s Computational Research Division (CRD) were awarded a
2008 R&D 100 Awards for developing the FastBit indexing technology
(figure 8). The award, given by R&D Magazine to the 100 top new
technologies of the year, went to Kesheng “John” Wu, the key
developer; Arie Shoshani, SDM Group Lead; Ekow Otoo; and former
SDM member Kurt Stockinger, now working in Europe.
   “To have technology from our organization recognized is a reflection
both of the quality of the scientific work we do as well as the
significance of its contribution to society,” said CRD Division Director
Horst Simon in congratulating the winners.
   For 45 years, the R&D 100 Awards—sometimes called “The Oscars
of Invention”—have recognized the most innovative ideas of the year.
Winners of the 2008 awards received their plaque at R&D Magazine’s
formal awards banquet in Chicago on October 16, 2008.                    Figure 8. Three developers of FastBit at the R&D 100 award
   The complete list of winners can be found at:                         ceremony. Left to right, Kesheng Wu, Arie Shoshani, and Ekow Otoo,
http://www.rdmag.com/Awards/RD-100-Awards/R-D-100-Awards/                all from LBNL.

dinalities, such as those from commercial data ware-         To summarize, FastBit has been proven to be the-
houses, medical applications, and network traffic         oretically optimal, and it performs 10–100 times
logs. FastBit search time was proven with both the-       faster than any known indexing methods. The key
oretical analyses and timing measurements to be           technologies that enable such impressive perform-
proportional to the compressed sizes (bytes) of           ance include the use of a patented compression
bitmaps involved in a search. Because the total size      method, advanced bitmap encoding methods, bin-
of compressed bitmaps is modest for high-cardinal-        ning, and vertical data organization. These unique
ity values, the total search time is also modest. The     characteristics have made it extremely useful in a
binning strategy is useful for high-cardinality data      variety of applications.
as well as multi-level encoding. FastBit implements          FastBit software was packaged in 2007 and
a variety of binning options for extremely high-car-      released under an open-source software license.
dinality values. This ability to index high-cardinal-     Since its release, it has attracted interest in many new
ity data is unique to FastBit and is not supported by     application areas, including computer network
other bitmap indexing methods.                            security from International Computer Science Insti-
                                                          tute, drug discovery from University of Hamburg,
Faster Applications                                       Germany, Web content delivery, VOIP traffic rout-
FastBit has been applied to a variety of applications,    ing, and web traffic analyses. In 2008, it received a
including simulation and experimental data from           R&D 100 award, which recognizes the 100 most
climate modeling, high-energy physics, astro-             innovative products for that year by R&D Magazine
physics, biological sequences, satellite imaging, and     (sidebar “FastBit Wins R&D 100 Award”).                ●

fusion. Recently FastBit was also demonstrated to
be helpful in real-time visual exploration, referred      Contributors Kesheng Wu, Arie Shoshani, and Ekow Otoo,
to as visual analytics. Using FastBit indexes signifi-    all at LBNL
cantly reduces the time needed for exploring a large
dataset. After a user modifies the search parameters,     Further Reading
the new results can be displayed in real-time, mak-       O. Rubel et al. 2008. High Performance Multivariate Visual
ing the exploration process truly interactive. For        Data Exploration for Extremely Large Data. SC08.
example, a scientist looking for the features of flame
fronts in a combustion simulation can interactively       K. Wu, E. Otoo, and A. Shoshani. 2006. Optimizing bitmap
vary the search parameters on temperature and var-        indices with efficient compression. ACM T. Database Syst.
ious chemical species to understand how flame             31: 1–38.
fronts behave and progress over time. An example
of visual analytics on laser wakefield particle accel-    K. Wu, E. J. Otoo and A. Shoshani. 2004. On the
erator simulation data is given in the sidebar “Fast-     Performance of Bitmap Indices for High Cardinality Attributes.
Bit in Visual Analytics” (p36).                           VLDB 2004.

SCIDAC REVIEW   FALL 2009   WWW.SCIDACREVIEW.ORG                                                                                              39

Shared By: