Comparing Scalable NOSQL Databases by learnguy

VIEWS: 45 PAGES: 19

Great presentation for web developers, programmers and other technology enthusiasts.

More Info
									                                Motivation
                 Overview of the databases
                              Methodology
                                   Results
                  Summary and conclusion

Clarications
  As a lot of people who read those slides did not get the oral
  explanations that MUST go with it, here are a few words of
  warning :
       All the databases were used with default congurations, I will
       post them soon on nosqlbenchmarking.com
       No index was set manually, doing so could have a big impact
       on performances
       Don't jump too fast on the conclusions, it would be WRONG
       to say that Cassandra is very good and that HBase sucks.
       The Cassandra implementation of MapReduce seems to be
       buggy and do not scale. There must be something wrong with
       my HBase conguration, HBase is known to run gigantic
       cluster without problems.
                                                                        2 / 20
                                 Motivation
                  Overview of the databases
                               Methodology
                                    Results
                   Summary and conclusion

Clarications
  Also keep in mind that a benchmark is always biased by the chosen
  methodology so :
        The way I store data in each database could have an impact
        on the performances
        The summary about the results should not be taken in an
        absolute way, especially the rst one. When I say Good or
        Bad it is in THIS particular case. Moreover raw results are not
        the most important, scalability is very important too. So good
        performances for Cassandra MapReduce but without
        scalability is NOT good.
        The data set is too small, I'm testing cache performances (but
        it is the same for all of the databases)
  I will add soon a written analysis and a self critic about those
  results on www.nosqlbenchmarking.com
                                                                          3 / 20
                                   Motivation
                    Overview of the databases
                                 Methodology
                                      Results
                     Summary and conclusion

Motivation
  YCSB

  Yahoo! Cloud Servicing Benchmark is the best known noSQL bench-
  marking application so why make another one?


         YCSB uses data generated from statistical distributions
         instead of real data

         YCSB only focuses on read/write/update/scan performances

         YCSB results for elasticity are not conclusive

  Idea

         Data and use case inspired by a concrete case : Wikipedia

         Test read/update performances

         Test MapReduce performances by computing an inverted
         search index
                                                                     4 / 20
                                Motivation
                                              Cassandra 0.6.10
                  Overview of the databases
                                              HBase 0.20.6
                              Methodology
                                              mongoDB 1.6.5
                                    Results
                                              Riak 0.14
                   Summary and conclusion




Cassandra 0.6.10




  Overview
  Cassandra is a fully distributed column oriented data store that pro-
  vides a MapReduce implementation using Hadoop.


      All the nodes in the cluster play the same role
      The data (existing and new) are sharded automatically among
      the nodes
      The developer can choose the consistency level for each
      request




                                                                          5 / 20
                                 Motivation
                                               Cassandra 0.6.10
                   Overview of the databases
                                               HBase 0.20.6
                               Methodology
                                               mongoDB 1.6.5
                                     Results
                                               Riak 0.14
                    Summary and conclusion




HBase 0.20.6


  Overview
  HBase is a column oriented database that aims to provide low latency
  requests on top of Hadoop HDFS

      An HBase cluster uses several kinds of servers :
             HDFS needs at least one  namenode          datanodes
                                                              and several

             HBase needs a     ZooKeeper cluster master    , a         and several

             regionservers
      The requests must be made to the master(s)
      On the HDFS level, existing data are not sharded
      automatically but new data are
      On the HBase level, the data are divided into regions that are
      sharded automatically across regionservers

                                                                                     6 / 20
                                 Motivation
                                               Cassandra 0.6.10
                   Overview of the databases
                                               HBase 0.20.6
                               Methodology
                                               mongoDB 1.6.5
                                     Results
                                               Riak 0.14
                    Summary and conclusion




mongoDB 1.6.5




  Overview

  mongoDB is a document oriented database that stores JSON dic-
  tionnaries. It provides auto sharding and a MapReduce implemen-
  tation.


       A mongoDB cluster is made of several kinds of servers :
             The shard servers that store data
             The conguration servers that store the conguration
             The router servers that receive and route the requests
       Existing and new data are sharded automatically

       MapReduce can only use one thread by server




                                                                      7 / 20
                                Motivation
                                              Cassandra 0.6.10
                  Overview of the databases
                                              HBase 0.20.6
                              Methodology
                                              mongoDB 1.6.5
                                    Results
                                              Riak 0.14
                   Summary and conclusion




Riak 0.14



  Overview
  Riak is a fully distributed key/bucket store with an implementation
  of MapReduce.


      Buckets can store the data directly or be a link to another
      bucket
      All the nodes in the cluster play the same role
      The data (existing and new) are sharded automatically
      amongs the nodes
      The developer can choose the consistency level for each
      request



                                                                        8 / 20
                                Motivation
                 Overview of the databases   The data used
                              Methodology    The client
                                   Results   The methodology
                  Summary and conclusion

The data

  Wikipedia export

  20.000 pages downloaded from Wikipedia



       Every document is in XML format

       All documents sum up to 620Mo

       Each document is associated to a single integer ID


  Insertions

  Each document is inserted only once during the whole benchmark




                                                                   9 / 20
                               Motivation
                Overview of the databases   The data used
                             Methodology    The client
                                  Results   The methodology
                 Summary and conclusion

The client

  Overview
      Fully random requests
      Acts as a perfect load balancer
      The proportion of updates can be specied
      Specic parts : read/write/update and MapReduce

  Updates
  The updates simply concatenate the string \1" at the end of the
  article.



                                                                    10 / 20
                               Motivation
                Overview of the databases   The data used
                             Methodology    The client
                                  Results   The methodology
                 Summary and conclusion

MapReduce
 Overview
 MapReduce is used to build a reverse index for a given keyword.
 The reverse index is a list of pairs made of :
      ID : the ID of the article if Count 6= 0
      Count : the number of occurrences of the keyword in this
      article
 Justication
 This kind of computation implies that all the documents are crawled
 and take advantage of the specications of MapReduce


                                                                       11 / 20
                                 Motivation
                  Overview of the databases   The data used
                               Methodology    The client
                                    Results   The methodology
                   Summary and conclusion

The methodology
   1   Start up a clean cluster of size 3 and insert all the documents
   2   Choose a total number of requests, a read percentage and
       starts the benchmark
   3   Wait one minute and starts the benchmark again
   4   Wait ve minutes and starts the benchmark again
   5   Start the MapReduce benchmark
   6   Add a new node to the cluster and wait for it to be ready then
       restart immediately the bench with the new node's IP in the
       list
   7   Jump to 3 until there are no more computer to add to the
       cluster

                                                                         12 / 20
                            Motivation
             Overview of the databases
                          Methodology
                               Results
              Summary and conclusion

Read/update results




                                         13 / 20
                            Motivation
             Overview of the databases
                          Methodology
                               Results
              Summary and conclusion

Read/update results without HBase




                                         14 / 20
                           Motivation
            Overview of the databases
                         Methodology
                              Results
             Summary and conclusion

MapReduce performance




                                        15 / 20
                                Motivation
                 Overview of the databases
                              Methodology
                                   Results
                  Summary and conclusion

The HBase case

  Verications made :
       Checked the logs : nothing seemed problematic
       HDFS level : running the balancer with a very low threshold
       distributed the blocks evenly but without any impact on the
       performances
       HBase level : the regions where always nearly evenly
       distributed across the regionservers
       The number of rows did not change and the content of each
       row was correct



                                                                     16 / 20
                               Motivation
                Overview of the databases
                             Methodology
                                  Results
                 Summary and conclusion

Summary of raw performances



   DB          read/update performances     MapReduce performances
   Cassandra                  Good                 Very Good

   HBase                 Bad / N.A.              Average / N.A

   mongoDB                    Good              Poor but scalable

   Riak               Poor / unstable         Average but scalable




                                                                     17 / 20
                                Motivation
                 Overview of the databases
                              Methodology
                                   Results
                  Summary and conclusion

Summary of scalability


  Going from 3 to 8 servers is a 266% increase in capacity, here are
  the observed increases in performances :
   DB                     read/update        MapReduce
   Cassandra                    153%           112%
   HBase                         11%            43%
   mongoDB                      145%           211%
   Riak                          74%           189%
   Riak 7 nodes max             155%           168%


                                                                       18 / 20
                                  Motivation
                   Overview of the databases
                                Methodology
                                     Results
                    Summary and conclusion

Conclusion and future work
  Conclusion
       The elastic gain seems more apparent than with YCSB but
       not linear either
       It is worth testing MapReduce performances as the results
       vary a lot between databases for both raw and scalability
       performances

  Future work
  This is still a work in progress :
       Applying this benchmark to other databases (Terrastore,
       Voldemort, Scalaris ...)
       Trying with a growing/bigger data set


                                                                   19 / 20
                               Motivation
                Overview of the databases
                             Methodology
                                  Results
                 Summary and conclusion

Questions and remarks



  Any questions or remarks?




                                            20 / 20

								
To top