Docstoc

MongoDB-NoSQL_Concepts_DocumentKeyValue_Stores_Christian_Grn__Prof_

Document Sample
MongoDB-NoSQL_Concepts_DocumentKeyValue_Stores_Christian_Grn__Prof_ Powered By Docstoc
					       Lecture: Advanced Database Technologies                    Universität
       Summer 2011                                                 Konstanz
       Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group




   Advanced Database Technologies
NoSQL: Concepts, Document/Key-Value Stores

                                            Christian Grün

                    Database & Information Systems Group
                            Universität Konstanz
              Lecture: Advanced Database Technologies                    Universität
              Summer 2011                                                 Konstanz
              Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
✎ Reminder
•   Scalability: horizontal vs. vertical
•   CAP Theorem: Consistency, Availability, Partition Tolerance
•   ACID vs. BASE (Basically Available, Soft State, Eventual Consistency)
•   MapReduce Framework (divide and conquer)
Upcoming
• Hashing: Distributed Hash Tables, Consistent Hashing
• Concurrency: Multiversion Concurrency Control
• Logical Clocks: Vector Clocks
                                                                                       2
             Lecture: Advanced Database Technologies                                           Universität
             Summer 2011                                                                        Konstanz
             Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
✎ Replication                                                           ✎ Sharding
• duplicates and synchronizes                                           • horizontal fragmentation
  data or computations                                                  • shard: single data partition
• improves reliability                                                  • distributes rows (tuples) on
• increases fault-tolerance                                               multiple machines
                                                                        • example: to reduce access latency,
Not to be confused with…
                                                                          data can be distributed by region
• load balancing: distributes differ-                                   • may complicate sequential access
  ent computations across nodes
                                                                        • MongoDB: native sharding support
• backups: for archiving purposes
                                                                                                             3
             Lecture: Advanced Database Technologies                                            Universität
             Summer 2011                                                                         Konstanz
             Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Distributed Hash Tables (DHT)
• efficient, decentralized storage
  of hash tables
• supports joining and leaving
  nodes, handles failures
                                                                  en.wikipedia.org/wiki/Distributed_hash_table
History
• research was motivated by peer-to-peer systems: Gnutella, Napster, …
• early software solutions were either vulnerable to attacks or inefficient
• DHT is used in the BitTorrent protocol and the Kad network
                                                                                                                 4
               Lecture: Advanced Database Technologies                     Universität
               Summer 2011                                                  Konstanz
                Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Consistent Hashing
Karger et al. [1997], Consistent Hashing and Random Trees
Task
• find machine that stores data for a specified key k
• trivial hash function to distribute data on n nodes: h(k,n) = k mod n
• if number of nodes changes (n ± 1), all data will have to be redistributed!
Challenge
• minimize number of nodes to be copied after a configuration change
• incorporate hardware characteristics into hashing model

                                                                                         5
              Lecture: Advanced Database Technologies                     Universität
              Summer 2011                                                  Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Consistent Hashing
sharplearningcurve.com/blog/post/2010/09/27/Consistent-Hashing.aspx
Solution
• include hash values of all nodes in hash structure
   (shown as Key1 and Key2)
• calculate hash value of the key to be added/retrieved
• choose node which occurs next in the hash structure
   (i.e., which has the smallest larger hash value)
• if no node with larger hash value found,
   choose first node
                                                                                        6
              Lecture: Advanced Database Technologies                     Universität
              Summer 2011                                                  Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Consistent Hashing
sharplearningcurve.com/blog/post/2010/09/27/Consistent-Hashing.aspx
• if a new node is added, its hash value is added
  to the hash table
• the hash realm is repartitioned, and hash data
  will be transferred to new neighbor
• if node is dropped or gets lost, missing data is
  redistributed to adjacent nodes (replication issue)
☞ no need to update remaining nodes!


                                                                                        7
               Lecture: Advanced Database Technologies                    Universität
               Summer 2011                                                 Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Consistent Hashing
sharplearningcurve.com/blog/post/2010/09/27/Consistent-Hashing.aspx
• to avoid too large gaps, multiple keys will be
  added for each node
• number of added keys can be made dependent
  on node characteristics (bandwidth, CPU, …)
• nifty details are left to the implementation
   (e.g.: DeCandia et al. [2007], Dynamo: Amazon's
   Highly Available Key-value Store)



                                                                                        8
               Lecture: Advanced Database Technologies                      Universität
               Summer 2011                                                   Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Multiversion Concurrency Control
Bernstein and Goodman [1983], Multiversion Concurrency Control – Theory and Algorithms
Challenge
• speed up concurrent read and write transactions
• avoid locking of transactions, but guarantee consistency
Locking
• readers-/writer lock (RWL): delay write operation if readers are active,
   give writer exclusive permission
• pessimistic locking: block access to resource if it is expected to be changed;
   hope that write operation will be fast enough
                                                                                          9
               Lecture: Advanced Database Technologies                      Universität
               Summer 2011                                                   Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Multiversion Concurrency Control
Bernstein and Goodman [1983], Multiversion Concurrency Control – Theory and Algorithms
Solution
• unique ids (timestamp, integers) are assigned to entities and transactions
• copy entity (e.g., tuple or block) to be updated and create new version
• reading transaction will access old versions
• writing transaction is aborted if its id is older than id of accessed entity
Consequences
• database size increases with each update operation
• db may need to be reorganized in regular intervals
                                                                                          10
               Lecture: Advanced Database Technologies                      Universität
               Summer 2011                                                   Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Multiversion Concurrency Control
Bernstein and Goodman [1983], Multiversion Concurrency Control – Theory and Algorithms
Used in numerous classical DBMS: Oracle, MS SQL, BerkeleyDB
Related: Versioning
• old versions can explicitly be made visible to user or query language
• example: revision control systems (CVS, Subversion, git)
Concurrent Writes
• CouchDB, git, etc. allow concurrent write operations
• if possible, results are automatically merged

                                                                                          11
            Lecture: Advanced Database Technologies                     Universität
            Summer 2011                                                  Konstanz
             Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Logical Clocks
Challenge
• recognize order of distributed events and potential conflicts
• most obvious approach: attach timestamp ts of system clock to each event e
☞ error-prone, as clocks will never be fully synchronized
☞ insufficient, as we cannot catch causalities (needed to detect conflicts)
Clock Consistencies
• weak consistency (chronology): if e1 e2, then ts(e1) < ts(e2)
• strong consistency (causality): if ts(e1) < ts(e2), then e1 e2

                                                                                      12
               Lecture: Advanced Database Technologies                     Universität
               Summer 2011                                                  Konstanz
                Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Vector Clocks
en.wikipedia.org/wiki/vector_clock
• an ID is assigned to each node
• ID is incremented for each event
• events may occur internally, or
  may be received or sent messages
• each node has a vector clock that contains the last known IDs of all nodes
• the clock is attached to each message and sent to other nodes
• if a message is received, the maximum values of the local and received
  clocks are merged
                                                                                         13
              Lecture: Advanced Database Technologies                     Universität
              Summer 2011                                                  Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Concepts
Vector Clocks
blog.basho.com/2010/01/29/why-vector-clocks-are-easy
A conflict occurs if maximum values are spread
across different vector clocks. Example:
[A1 B0 C0] Alice proposes a dinner on WED or THU to Ben and Cathy.
[A1 B2 C0] Ben tells Alice that he’d go for WED.
[A1 B0 C2] Cathy tells Ben that she prefers THU.
[A3 B2 C0] Alice tells Ben that WED has been chosen as favorite.
[A1 B3 C2] [A3 B3 C0] Ben has to decide between Alice’s and Cathy’s choice.
☞ Conflict: A1 < A3 C2 > C0
                                                                                        14
               Lecture: Advanced Database Technologies                    Universität
               Summer 2011                                                 Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Categories
Selected Categories
nosql-databases.org
•   Document Stores
•   Key-Value Stores
•   Column Stores
•   Graph Databases
•   Object Databases

☞ no taxonomy exists that all parties agree upon
☞ might look completely different some years later
                                                                                        15
             Lecture: Advanced Database Technologies                    Universität
             Summer 2011                                                 Konstanz
             Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document Stores
Overview
• basic entities (tuples) are documents
• schema-less storage no null values (✎ why?)
• document format depends on implementation:
  XML, JSON, YAML, binary data, …
• more powerful than key/value stores: offers query and indexing facilities
• first document store (commercial): LotusDB, developed in 1984
• recent solutions: CouchDB and MongoDB (free), SimpleDB (commercial)
✎ Do XML databases belong to the same category?

                                                                                      16
               Lecture: Advanced Database Technologies                    Universität
               Summer 2011                                                 Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document Stores
CouchDB
“Cluster Of Unreliable Commodity Hardware DataBase”
• developed by Damien Katz (senior developer of Lotus
  Notes) to merge MapReduce concept with document stores
• written in Erlang: functional, concurrent language; hot swapping support
• interaction, APIs: RESTful API; JavaScript, PHP, Perl, Ruby, …
• document format: JavaScript Object Notation
• concurrency model: MVCC (✎ remember?)
• replication: incremental, bi-directional

                                                                                        17
            Lecture: Advanced Database Technologies                                                    Universität
            Summer 2011                                                                                 Konstanz
             Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document Stores
                                                                        {
Documents                                                                   "firstName": "John",
                                                                            "lastName": "Smith",
                                                                            "age": 25,
• stored in B+ trees                                                        "address":
                                                                            {
• referenced by document and revision ids                                     "city": "New York",
                                                                              "postalCode": "10021"
                                                                              "streetAddress": "21 2nd Street",
JSON: JavaScript Object Notation                                            },
                                                                            "phoneNumber":
                                                                            [
• text-based standard for representing                                        {
                                                                                "type": "home",
  simple data structures                                                        "number": "212 555-1234"
                                                                              },
• light-weight alternative to XML                                             {
                                                                                "type": "fax",
• data types (✎ check example): numbers,                                        "number": "646 555-4567"
                                                                              }
  strings, booleans, arrays, objects                                        ]


                                                                                                                     18
                                                                        }
              Lecture: Advanced Database Technologies                    Universität
              Summer 2011                                                 Konstanz
              Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document Stores
REST: Representational State Transfer
• “RESTful”: conforming to the REST constraints
• architecture style, using HTTP as protocol
• allows clients and server to exchange information in a unified way
• four operations: GET & POST (…known from web pages), PUT, DELETE
Example
$   curl -X PUT -d "{ \"key\": 123 }" http://localhost:7777/db/json1
{   "ok":true, "id":"json1", "rev":"242612878" }
$   curl http://www.example.com:7777/db/json1
{   "_id":"json1", "_rev":"242612878", "key": 123 }

                                                                                       19
              Lecture: Advanced Database Technologies                    Universität
              Summer 2011                                                 Konstanz
              Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document Stores
Views
• constructed by arbitrarily complex JavaScript functions,
  which can be executed in parallel (as map function)
• views can be indexed, indexes can be incrementally updated
Replication
• CouchDB instances are offline by default: synchronization happens
  in a pre-defined fashion if client goes online
• incremental: synchronization may be interrupted and continued later
• bi-directional: modifications may happen on either side
                                                                                       20
              Lecture: Advanced Database Technologies                    Universität
              Summer 2011                                                 Konstanz
              Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document Stores
MongoDB
“huMONGOus DataBase”
•   developed by Dwight Merriman and Eliot Horowitz (DoubleClick, ShopWiki)
•   written in C++; interaction: JavaScript; APIs: Java, PHP, …
•   concurrency model: update-in-place
•   replication: master-slave
•   scalability: automatic sharding (✎ remember?)
•   document format: BSON (binary JSON)
•   prominent users: SourceForge, bit.ly, foursquare, diaspora, New York Times

                                                                                       21
             Lecture: Advanced Database Technologies                    Universität
             Summer 2011                                                 Konstanz
             Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document Stores
BSON: Binary JSON
• data types: string, integer, double, boolean,
  date, byte array, BSON objects & arrays, null
• designed to be efficient in terms memory consumption and scan-speed
• BSON is a superset of JSON (except for some size limitations)
Replication: Master-Slave
Database is usually deployed on at least two servers:
• single master node performs reads and writes
• slave nodes are used for reads and backups
                                                                                      22
              Lecture: Advanced Database Technologies                    Universität
              Summer 2011                                                 Konstanz
              Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document Stores
Database Communication
Powerful interface, based on JavaScript.
Example
>   db.createCollection("test");
>   db.test.insert({ title: "MongoDB", categories: ["NoSQL", "Database"] });
>   var docs = db.test.find({ categories: "NoSQL" });
>   docs.categories = ["NoSQL", "Document Database"];
>   db.test.save(docs);
>   ...
>   db.test.find().sort({ title: 1 }).skip(20).limit(10);
>   ...

                                                                                       23
            Lecture: Advanced Database Technologies                    Universität
            Summer 2011                                                 Konstanz
            Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Key-Value Stores
Overview
• simple common baseline: maps or dictionaries, storing keys and values
• also called: associative arrays, hash tables/maps
• keys are unique, values may have arbitrary type
• focus: high scalability (more important than consistency)
• traditional solution: BerkeleyDB, started in 1986
• revived by Amazon Dynamo in 2007 (proprietary)
• recent solutions: Redis, Voldemort, Tokyo Cabinet, Memcached
☞ (very) limited query facilities; usually get(key) and put(key, value)

                                                                                     24
               Lecture: Advanced Database Technologies                      Universität
               Summer 2011                                                   Konstanz
                Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Key-Value Stores
Amazon Dynamo
DeCandia et al. [2007], Dynamo: Amazon's Highly Available Key-value Store
• most important requirement: reliability
• runs tens of thousands of servers (mostly commodity hardware)
• one of several databases used at Amazon components fail continuously
• use of relational databases would “lead to inefficiencies and limit scale and
  availability”
Concepts
• replication and partitioning via consistent hashing (✎ remember?)
• consistency facilitated by object versioning, based on vector clocks
                                                                                          25
               Lecture: Advanced Database Technologies                      Universität
               Summer 2011                                                   Konstanz
                Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Key-Value Stores
Amazon Dynamo
DeCandia et al. [2007], Dynamo: Amazon's Highly Available Key-value Store
Arguments against RDBMS
• complex querying and management functionality is not needed
• systems with ACID properties have poor availability
Consequences
• operations are limited to one key/value pair at a time:
   no cross-references, no multiple updates, no isolation guarantees
• services are only used internally by Amazon no security layers needed
• optimistic replication scheme (“always writeable”)
                                                                                          26
                    Lecture: Advanced Database Technologies                                                     Universität
                    Summer 2011                                                                                  Konstanz
                     Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Key-Value Stores
Amazon Dynamo
DeCandia et al. [2007], Dynamo: Amazon's Highly Available Key-value Store
Problem                             Technique                                   Advantage
Partitioning                        Consistent Hashing                          Incremental Scalability
                                    Vector Clocks with
High Availability for writes        reconciliation during reads                 Version size is decoupled from update rates
                                    Sloppy Quorum                               Provides high availability and durability guarantee
Handling temporary failures         and hinted handoff                          when some of the replicas are not available
Recovering                          Anti-entropy                                Synchronized
from permanent failures             using Merkle trees                          divergent replicas in the background
                                                                                Preserves symmetry and avoids having a
Membership                          Gossip-based membership                     centralized registry for storing membership


                                                                                                                               27
and failure detection               protocol and failure detection              and node liveness information.
              Lecture: Advanced Database Technologies                    Universität
              Summer 2011                                                 Konstanz
              Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Key-Value Stores
Redis
•   in-memory database, sponsored by VMWare
•   written in ANSI C, bindings in numerous other languages
•   replication: master-slave; slaves can write and may have other slaves
•   persistence: snapshots (asynchronous transfer from main-memory to disk)
•   keys may be of other type than string: lists, (sorted) sets, hashes
•   databases supports high level operations: intersection, union, difference
•   performance: very fast, no notable difference between read and write
•   not suitable for querying data (limited to value retrievals)

                                                                                       28
              Lecture: Advanced Database Technologies                    Universität
              Summer 2011                                                 Konstanz
              Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Key-Value Stores
Memcached
•   memory caching system used to speed up database-driven applications
•   transient database: data is purged if not requested for a too long time
•   central hash table may be distributed across multiple machines (…huge)
•   security: source code can be compiled to support SASL authentication
•   used by YouTube, Facebook, Twitter, Reddit, …
Client/Server Communication
• client calculates hash of key to be stored and sends it to the server
• server computes second hash, which determines target server
                                                                                       29
              Lecture: Advanced Database Technologies                     Universität
              Summer 2011                                                  Konstanz
               Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Key-Value Stores
Project Voldemort
project-voldemort.com/design.php
• developed for LinkedIn network
• database API: get(key), put(key, value), delete(key)
• no filters, no joins, no key constraints, no triggers, …
Reasoning
• only efficient queries are supported, predictable performance
• service-orientation often disallows foreign key constraints
• data can be easily distributed

                                                                                        30
            Lecture: Advanced Database Technologies                    Universität
            Summer 2011                                                 Konstanz
            Prof. Dr. Marc H. Scholl, Dr. Christian Grün. DBIS Group



NoSQL: Document/Key-Value Stores
Summary
• document stores are helpful for handling and querying textual data
• key/value stored are needed when processing large amounts of map data
• none of the approaches offers all-in-one solutions
  (which RDBMSs often claim for themselves)
• instead, new systems mainly tackle performance and scalability issues
• often, solutions can be combined (example: Amazon Dynamo collaborates
  with existing RDBMS solutions, such as MySQL)
✎ What about Column, Graph and Object Databases?

                                                                                     31

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:6/11/2012
language:
pages:31