NOSQL Databases_ Topics by bestt571


NoSQL, refers to a non-relational database. With the rise of the Internet web2.0 site, the traditional relational database in dealing with web2.0 site, especially the large scale and high concurrent SNS type of web2.0 pure dynamic website has appeared to be inadequate, exposes a lot of difficult problems to overcome, rather than the relational database is characterized by its own has been very rapid development.

More Info
									                     NOSQL Databases: Topics
• Introduction
• Rationale
• Key-value stores
• MapReduce
• Implementations

• NOSQL := Not Only SQL
• Acronym introduced in 2009
  3 as the name of a meetup about open-source distributed non-relational
• Message misunderstood, giving birth to “NoSQL”

                 Rationale (1)
• Performance
• Scalability
• Flexibility
• Kind of Data

                           Rationale (2)
• Brewer’s CAP Theorem
• Cannot guarantee more than two of
  3 Coherence
  3 Availability
  3 Partition tolerance






                                         KV Durable


                                     NOSQL              Column Store

                   KV Volatile





                          Key-Value Stores
• Global collection of Key/Value pairs
• Multiple types
  3 In memory (Redis, Memcache)
  3 On disk (BerkeleyDB)
  3 Eventually Consistent (Cassandra, Dynamo, Voldemort)

                       Document Databases
• Similar to Key/Value database, with whole document as values.
• Flexible schema
• Documents are Serialized
• Examples: CouchDB, MongoDB

                     Column Family Database
• Similar to Key/Value database, with multiple attributes (columns) as values.
• Not to be confused with column-oriented DBMS

                         Graph Databases
• Inspired by Graph Theory
• Gains popularity as RDF store
• Examples Neo4j, InfiniteGraph

• Many other exist:
  3 Any database outside the relational model
• Object databases
• File System

                        Key-Value Stores
• Basic Idea
• Mapping Tables to KV pairs
• Consistent Hashing

                              Basic Idea
• Very simple data model
• {key,value} pairs with unique keys
  3 {student_id: student_name}
  3 {part_id: part_manufacturer}
  3 {child_id: parent_id}
• Values have no type constraint

• put(key, value)
• get(key)
  3 value = get(key)
• value is usually composite
  3 Opaque blob (e.g. TokyoCabinet)
  3 Directly supported (e.g. MongoDB)

• Usually B-trees or extensible hash tables
• Well-known structures in RDMS world

Mapping Tables to KV pairs

                     Mapping Tables to KV pairs
CREATE TABLE u s e r (
               AC A
     username V R H R( 6 4 ) ,
               AC A
     password V R H R( 6 4 )
CREATE TABLE f o l l o w s (
     f o l l o w e r INTEGER REFERENCES u s e r ( i d ) ,
     f o l l o w e d INTEGER REFERENCES u s e r ( i d )
CREATE TABLE t w e e t s (
     i d INTEGER,
     u s e r INTEGER REFERENCES u s e r ( i d ) ,
              AC A
     message V R H R( 1 4 0 ) ,
     timestamp TIMESTAMP

               Mapping Tables to KV pairs — Redis
• Creating a user

               INCR g l o b a l : n e x t U s e r I d => 1000
               SET u i d : 1 0 0 0 : username j o h n s m i t h
               SET u i d : 1 0 0 0 : password sunnyEvening

• Enabling logging-in

               SET username : j o h n s m i t h : u i d 1000

• Following:

               u i d : 1 0 0 0 : f o l l o w e r s => S e t o f u i d s
               u i d : 1 0 0 0 : f o l l o w i n g => S e t o f u i d s

            Mapping Tables to KV pairs — Redis
• Messages by user:

            u i d : 1 0 0 0 : p o s t s => a L i s t o f p o s t i d s

• Adding a new message:

            SET p o s t : 1 0 3 4 3 ” $ o w n e r i d | $time | I ’m having fun ”

                           Consistent Hashing
• Huge amounts of data
  3 Naive approach:

                  s e r v e r i d = hash ( key ) % n u m b e r o f s e r v e r s

  3 Hash function: anything → int
• Distribution?

                   Consistent Hashing — Circle
• Assume int to be an 8-bit unsigned integer
• We have hash(key) ∈ 0, 255
• We can represent these values on a circle and:
  3 Assign a position to each server
  3 Compute the position of each key
  3 Assume a key k belongs to the next server on the circle (clockwise)

                   Consistent Hashing — Circle

• Each node (server) is assigned a random value
• The hash of this value gives the position of the server on the circle
• A server is responsible for the arc before its position

• Adding a node

Virtual nodes

Moving nodes

• Coordinator as defined previously
• In charge of replication to other nodes (e.g. N next ones)
• Parameters :
  3 Number of replicates (N )
  3 Minimal number of successful writes (W )
  3 Minimal number of coherent reads (R)
  3 Must respect R + W > N (Why ?)
• Repair-on-read

NOSQL Databases: Topics

      Key-value stores

     Parallel processing model
     Introduced to tackle computations over very large datasets
     Based on the well-known divide and conquer approach
         Large problem divided in many small problems
         Each tackled by one “processor” (Map)
         Results are then combined (Reduce)

     References: MapReduce (textbook), Lin and Dyer, 2010

       Not a new problem
           E.g. threads, MPI, sockets, remote shell, . . .
           Generally tackles computation distribution, not data
           The developper is in charge of the implementation details.
       MapReduce offers an abstraction of many mechanisms by
       imposing a structure to the program.
MapReduce Concepts


    Mapper    Mapper    Mapper    Mapper    Mapper

    Reducer   Reducer   Reducer   Reducer   Reducer

Origins of Map

      Map originally comes from the functional programming world
      Basic idea:
      for ( int i = 0; i < arr . length (); i ++) {
            result [ i ] = function ( arr [ i ]);

      where function is a function in the mathematical sense
Origins of Map

      Idea: isolate the loop, so we can write:
      result = map ( function , arr );

      What if you could pass functions around as values?
      map could be a function that takes as arguments
          a sequence
          a function
      and that returns a new sequence where every element is the
      result of applying the function on the corresponding element
      in the original sequence
      map can abstract many for loops
Origins of Reduce

      map does not cover all for loops
      For example, when you gradually aggregate the results:
      int total = 0
      for ( int i = 0; i < arr . length (); i ++) {
            total = total + arr [ i ];

      More generally:
      for ( int i = 0; i < arr . length (); i ++) {
            total = function ( total , arr [ i ]);

      reduce covers these ones:
      total = reduce ( function , arr );
map and reduce in MapReduce

      In the context of MapReduce, the mapped function must
      return key-value couples:

              map(function, [data, . . . ]) → [(key, value), . . . ]

      Before the reduction, the data has to be aggregated by key:

      [(key1, value1), (key1, value2), . . . ] → (key1, value1, value2, . . . )

      Reduce step acts on values for each key

             reduce(key1, value1, value2, . . . ) → (key1, value)

     Counting the words in a text
     map: word → (word, 1)
     Pair make_pair ( String word ) {
         return new Pair ( word , 1);

     Aggregation: (word, 1, 1, 1, . . . )
     Pair compute_sum ( String word , List < int > values ) {
         int sum = 0;
         for ( int i : values ) {
               sum += i ;
         return new Pair ( word , sum );

              Trivial: absolutely no side effect
              (or not: what about transfer times?)
              Not fully parallelizable (each step needs the result of the
              previous step)
Parallelizing reduce

       Reduce needs to be idempotent
           Mathematically: f (f (x)) = f (x)
       Computation can be tree-shaped:
       log N instead of N
We lied!

      There is still one step to discuss: How do we aggregate values
      by keys ?
      Naive idea: put a barrier between map and reduce
           Wait for all maps to complete
           Get all results in one place, sort them
           Redistribute them for reduce

                        Mapper    Mapper    Mapper    Mapper    Mapper


                        Reducer   Reducer   Reducer   Reducer   Reducer

Parallelizing aggregation

       The naive approach:
           is simple
           does not require an idempotent reduce
           is not as parallel as it could be
       Other idea: consistent hashing and idempotence
           Can compute results incrementally (idempotence)
           No barrier: better parallelism (hashing)
           Can display current results (idempotence)
       Note: usually, the implementation sorts the intermediate
       key-value pairs generated by map and the final results by key.
       This can be exploited by choosing a meaningful key.
Example: Sorting people by name

      map: person → (, person)
      reduce: (, person1, person2, . . . ) →
      (, person1, person2, . . . )
      The result is sorted by virtue of the MapReduce machinery
Example: Finding all (author,book) pairs

      There can be multiple authors per book!
          We need a polymorphic map function, say f , such that:
               f (author) → (, author)
               f (book) → [(, book), . . . ]
      Aggregation: (; book∗ , author, book∗ )
      In the following code, Value is a superclass of Author, Book
      and List.
Example: Finding all (author,book) pairs
       Pair reduce ( String authorName , List < Value > values ) {
           Author a = n u l l ;
           Book prevbook = n u l l ;
           List < Pair > list = new List < Pair >();
           f o r ( Value value : values ) {
                 i f ( value i n s t a n c e o f Author ) {
                       a = ( Author ) value ;
                        i f ( prevbook != n u l l ) {
                              list . append (new Pair (a , prevbook ));
                              prevbook = n u l l ;
                 } e l s e i f ( value i n s t a n c e o f Book && a == n u l l ) {
                        i f ( prevbook != n u l l ) emit ( prevbook );
                       prevbook = ( Book ) value ;
                 } e l s e i f ( value i n s t a n c e o f Book && a != n u l l ) {
                       list . append (new Pair (a , prevbook ));
                 } e l s e i f ( value i n s t a n c e o f List < Pair >) {
                       list . append_all ( value );
                       a = list . first (). author ;
           i f ( prevbook != n u l l ) emit ( prevbook );
           i f (! list . empty ()) emit ( list );


      LightCloud is a distributed key-value store
          Implements distributed storage.
          “On-site” storage is provided by Tokyo Tyrant/Redis
      Tokyo Tyrant is a local key-value store
          Implements database managment functions
               Network interface and concurency control
               Database replication
          Actual storage is provided by Tokyo Cabinet
      Tokyo Cabinet
          Implements storage of key/value pairs
          Over a single file, for a single client.

                             Tokyo Tyrant
                             Tokyo Cabinet

             Tokyo Tyrant                    Tokyo Tyrant
             Tokyo Cabinet                   Tokyo Cabinet
Tokyo Cabinet/Tyrant

      Tokyo Cabinet/Tyrant provide a very raw interface for storing
      key/value pairs in a given single file
      The desired on-disk layout must be chosen
          Extensible Hash Map, B-Tree, Fixed-size records, . . .
          Parameters of these structures can be tweaked for better
          Very demanding on the user
      The API consists of get and put and a few variants
          The data are opaque, unstructured blobs!

      Adds (horizontal) scalability to Tokyo Tyrant nodes by means
      of consistent hashing
          Mitigates the distribution problem
          However, no replication is performed; consistency is preferred
          over availability.
      The API is still get and put, over strings.

    MongoDB is a document oriented database
    json documents
          " name ": " John Smith " ,
          " address ": {
               " city ": " Owatonna " ,
               " street ": " Lily Road " ,
               " number ": 32 ,
               " zip ": 55060
          " hobbies ": [ " yodeling " , " ice skating " ]
Database Organisation

      Databases contain collections
      Collections contain documents and indexes
Physical layout

       Documents are stored as binary blobs (BSON)
           Documents are opaque for the database
           As a result of a query they are retrieved in their entirety
       Indexes are B-Trees referencing these documents.
           Allows to find documents based on the values they contain
           without explicitely opening the whole document.
Advanced querying

      Simple queries can be performed efficiently when an index is
          E.g. db.employee.find({"": "Owatonna"})
          with an index on ””
      Larger jobs can be done by means of map-reduce
          map maps a document to the needed key-value pair.
Advanced querying

      However, there is no facility for:
           Joining documents
           Quantifying over other documents (i.e. EXISTS in SQL)
      Such operations are left to the user of the database!
           Processing outside the database is costly!
           It is therefore important to design the data model in such a
           way that it returns the appropriate data directly.

      MongoDB can shard documents over multiple servers
           Data are split into chunks
           A chunk has a starting and ending value.
           A server is Responsible for multiple chunks.
      Individual collections and not whole databases are sharded
Example: Sharding Persons over the Age field on 3 servers
                Server 1 Server 2 Server 3
                 1–10      11–20     22–29
                 21–22     30–41     42–50
                 51–72                73+
To be efficient, each server must keep roughly the same
amount of data.
    Mongodb provides automated balancing (auto-sharding) as
    much as possible
Shards are created explicitely by the database administrator
    shard = (collection, key)
    Well chosen, can improve query performance
    Otherwise, the load of each server can be very unbalanced

      Introduction and history
      Data model and layout
            Adding nodes
            Handling problems
Cassandra — Introduction

      Created by Facebook
          Based on Dynamo
          Lead Dynamo engineer hired by Facebook
      Released as Apache project
          Source code released in July 2008
          Adopted by Apache in March 2009
          Became high priority in February 2010
Cassandra — Data model

      Databases are conceptually two-dimensional
      Disks are one-dimensional
              1 2
      Table:         can be stored as either row-oriented (1, 2, 3, 4)
              3 4
      or column-oriented (1, 3, 2, 4); Cassandra is column-oriented
      No cost for NULL entries
      Easy column creation
          Column family ∼ table
          Super column ∼ columns
          Column ∼ column
      May be seen as a hash table with 4 or 5 dimensions:
      get ( keyspace , key , column_family [ , super_column ] , column )
Cassandra — Distribution

      CAP Theorem:
          Partition tolerance
      Design goals
          Uniformity between nodes
      Consistent Hashing on a ring
          No virtual nodes
          Random placement
Cassandra — Replication and consistency

      Availability ⇒ more than one node needs a copy of each pair
      Responsible node choses N other nodes to hold copies
          Way in which those are chosen can be changed
          Next ones on the ring, different geographic location, etc.
      Attribution table copied to each node
      Possibility of choosing R and W values
Cassandra — Timestamping

      Every data has an associated timestamp
      Every key actually has an associated vector of
      (timestamp, value) pairs (truncated)
      Used to reach consistency with repair-on-read
      Query sequence:
          Identify the nodes that own the data for the key
          Route the request to the node and wait for the response
          If the reply does not arrive within the configured timeout, fail
          Figure out the latest response based on timestamps
          Schedule a repair if needed
      Repair algorithm can be customized
Cassandra — Adding a node

          Each node must know the position of every other node (and all
          Whenever a node moves or changes its replicas, it tells a
          number of other nodes, sending its whole replication table
          Routing information thus propagates
          Some nodes are preferred (seeds)
      When a new node is inserted, we must give it a keyspace
      and the address of a seed
          It chooses its position at random
          It contacts the seed to get a view of the current state
          It begins to move its data
Cassandra — Problem solving
      Overloaded node
               The keys are not uniformly distributed
               Some keys are accessed more than others
               The node runs on inferior hardware
               Overloaded nodes may move on the ring
      Unresponsive node
               The machine has crashed
               There is too much latency on the network
               Each node attributes a score to its neighbour
               Inverse logarithmic scale: 1 means 10% chance to wake up, 2
               means 1%, etc.
               Define a threshold after which the node is removed
      Can be mostly automated

To top