NoSQL Overview Examples_ Lots of Data Buy a Bigger Computer by bestt571


More Info
									NoSQL Overview                                                                      Examples, Lots of Data

     Relational databases can be used to solve all kinds of problems.                      Twitter 95 million tweets per day (1100 per second) must be stored.
     But are maybe not the right solution to all problems.                                         Only simple queries (based on primary key, no joins). Used
     New applications (often web-centric) have new requirements.                                   MySQL earlier, now Cassandra (and more).
            Huge amounts of data (terabytes or petabytes)                               Facebook 500 million active users, half of them log in every day. Each
            Simple data structure (often)                                                        user has 130 friends (on average). 30 billion pieces of
            Must scale well                                                                      content (links, texts, blog posts, photo albums) accessed
     NoSQL = “Not only SQL”. A better name would be “Not only                                    every day. (Cassandra)
     relational”.                                                                        LinkedIn More than 90 million members, one new member every
     A mixture of ideas, concepts, tools, products, . . .                                         second. Two billion people searches per year. (Voldemort)

Per Holm (     Database Technology            2010/11   1 / 15    Per Holm (   Database Technology            2010/11   2 / 15

Buy a Bigger Computer Instead?                                                      Is It New?

     Big computers can store lots of data . . .
     Big computers are expensive
                                                                                          Yes: the term NoSQL is from 2009.
     And you have to pay big license fees for a big Oracle installation
                                                                                          But NoSQL databases have been around longer than that.
     Even big computers can fail
                                                                                          And before anything NoSQL there were object-oriented databases,
     Better to use a lot of cheap commodity PC-s
                                                                                          hierarchical databases, network databases, . . .
     And replicate data so one or a few failing nodes don’t matter
     Design the storage system so it can be expanded (during uptime) by
     adding PC-s

Per Holm (     Database Technology            2010/11   3 / 15    Per Holm (   Database Technology            2010/11   4 / 15
Different Types of Data Stores                                                       The CAP Theorem

 Key–Value A distributed hash table. Arbitrary key type; the value is a             The CAP Theorem says that you cannot have all three of Consistency,
           “blob”. The application program must be aware of the                     Availability. and Partition tolerance.
           structure of the value. (Amazon Dynamo)                                       Strong Consistency: all clients see the same version of the data, even
  Document As key–value, but the value is a document, and the DBMS                       on updates to the dataset – e.g. by means of the two-phase commit
           knows that. (MongoDB, CouchDB)                                                protocol,
    Columns The value is a set of columns, like in a relational database,                High Availability: all clients can always find at least one copy of the
            but they do not necessarily follow a schema. (Google                         requested data, even if some of the machines in a cluster is down,
            BigTable, Cassandra)                                                         Partition-tolerance: the total system keeps its characteristic even
        Graph The database is a set of nodes with properties, and a set of               when being deployed on different servers, transparent to the client.
              connections between the nodes (with properties). (Neo4J)              Many NoSQL systems sacrifice consistency and go for BASE (next slide).

Per Holm (   Database Technology              2010/11   5 / 15   Per Holm (   Database Technology               2010/11   6 / 15

No ACID, BASE Instead                                                               Amazon Dynamo

                                                                                    Dynamo was developed by Amazon.
Transactions are no longer guaranteed to be ACID: atomic, consistent,                    First used for the shopping cart, now also for other applications.
isolated, durable). BASE is almost the opposite: basically available, soft               Goal: always available, writes never fail.
state, eventually consistent.                                                            Key–value store. Records are replicated on several computers.
                                                                                         Read & write: only single records.
BASE is optimistic and accepts that the database consistency is in a state
of flux. “Eventual consistency” (actually more like durability) means that                Operations: get(key) returns a value or a list of several versions of a
inaccurate reads are permitted just as long as the data is synchronized                  value. The application must solve problems with inconsistencies.
“eventually.” (Compare with DNS, it takes time for changes to propagate.)                put(key,value) writes a value. The key is hashed, the hash code
                                                                                         determines on which nodes the value should be stored (“consistent

Per Holm (   Database Technology              2010/11   7 / 15   Per Holm (   Database Technology               2010/11   8 / 15
Cassandra                                                                                     Computing Model

First developed by Facebook, now a top-level Apache project.                                  Not only storage should be distributed, but also computing. It is difficult
                                                                                              to write parallel programs . . . MapReduce is a new programming model.
     Key–value & replication like in Dynamo.
                                                                                                   All data is treated as sets of key–value pairs. The key is a string, the
     But the value has structure: it contains columns (which are stored in
                                                                                                   value is a blob.
     column families which may be stored in super columns). A column
     has a name, a value, and a timestamp. Columns may be sorted on                                All programs are sequences of alternating map and reduce functions.
     value or on timestamp.                                                                        The map function processes a key–value pair and generates one or
     Inbox search at Facebook: 50+ TB of data stored on 150 machines.                              more intermediate key–value pairs.
            Term search: the user id is the key. Words in messages are the super                   The reduce function merges all intermediate values associated with
            columns, message id’s become the columns.                                              the same intermediate key.
            Interaction search: the user id is the key. Recipient id’s are the super
                                                                                                   Map functions run in parallel on many computers, as do reduce
            columns, message id’s become the columns.

Per Holm (       Database Technology                   2010/11    9 / 15   Per Holm (   Database Technology               2010/11   10 / 15

MapReduce Example                                                                             MapReduce Data Flow

Compute word counts within a set of documents.

  map(key, value):
      // key: document name
      // value: document text
      for each word w in value:
          emitIntermediate(w, 1)

  reduce(key, values)
      // key: a word
      // values: a list of counts
      result = 0
      for each v in values:
          result += v

Per Holm (       Database Technology                  2010/11    11 / 15   Per Holm (   Database Technology               2010/11   12 / 15
MapReduce Figures (From Google)                                                    MapReduce vs Traditional Databases

Execution on a cluster of 1800 machines, 2 × 2GHz processors, 4GB
memory, 320GB disk, Gigabit Ethernet. The figures are from the original                  Data has no explicit schema
MapReduce paper, 2004.                                                                         The map and reduce functions must “understand” the data format.
                                                                                               Users have to write procedural code to interpret and process the data.
         Grep Scan through 1010 100-byte records, searching for a                              A step backwards?
              three-character pattern. 150 seconds, including 60 seconds                       Higher-level programming languages for MapReduce: PIG, Hive.
              startup overhead.                                                         Data is stored in files in a distributed file system.
          Sort Sort 1010 100-byte records. 15 minutes.                                  All processing is sort based — makes the programming easier, but
      Google Google web search uses an index which is created with                      may be a performance concern.

Per Holm (   Database Technology            2010/11   13 / 15   Per Holm (      Database Technology                2010/11   14 / 15

More Information

Per Holm (   Database Technology            2010/11   15 / 15

To top