NoSQL, refers to a non-relational database. With the rise of the Internet web2.0 site, the traditional relational database in dealing with web2.0 site, especially the large scale and high concurrent SNS type of web2.0 pure dynamic website has appeared to be inadequate, exposes a lot of difficult problems to overcome, rather than the relational database is characterized by its own has been very rapid development.
NoSQL Overview Examples, Lots of Data Relational databases can be used to solve all kinds of problems. Twitter 95 million tweets per day (1100 per second) must be stored. But are maybe not the right solution to all problems. Only simple queries (based on primary key, no joins). Used New applications (often web-centric) have new requirements. MySQL earlier, now Cassandra (and more). Huge amounts of data (terabytes or petabytes) Facebook 500 million active users, half of them log in every day. Each Simple data structure (often) user has 130 friends (on average). 30 billion pieces of Must scale well content (links, texts, blog posts, photo albums) accessed NoSQL = “Not only SQL”. A better name would be “Not only every day. (Cassandra) relational”. LinkedIn More than 90 million members, one new member every A mixture of ideas, concepts, tools, products, . . . second. Two billion people searches per year. (Voldemort) Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 1 / 15 Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 2 / 15 Buy a Bigger Computer Instead? Is It New? Big computers can store lots of data . . . Big computers are expensive Yes: the term NoSQL is from 2009. And you have to pay big license fees for a big Oracle installation But NoSQL databases have been around longer than that. Even big computers can fail And before anything NoSQL there were object-oriented databases, Better to use a lot of cheap commodity PC-s hierarchical databases, network databases, . . . And replicate data so one or a few failing nodes don’t matter Design the storage system so it can be expanded (during uptime) by adding PC-s Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 3 / 15 Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 4 / 15 Diﬀerent Types of Data Stores The CAP Theorem Key–Value A distributed hash table. Arbitrary key type; the value is a The CAP Theorem says that you cannot have all three of Consistency, “blob”. The application program must be aware of the Availability. and Partition tolerance. structure of the value. (Amazon Dynamo) Strong Consistency: all clients see the same version of the data, even Document As key–value, but the value is a document, and the DBMS on updates to the dataset – e.g. by means of the two-phase commit knows that. (MongoDB, CouchDB) protocol, Columns The value is a set of columns, like in a relational database, High Availability: all clients can always ﬁnd at least one copy of the but they do not necessarily follow a schema. (Google requested data, even if some of the machines in a cluster is down, BigTable, Cassandra) Partition-tolerance: the total system keeps its characteristic even Graph The database is a set of nodes with properties, and a set of when being deployed on diﬀerent servers, transparent to the client. connections between the nodes (with properties). (Neo4J) Many NoSQL systems sacriﬁce consistency and go for BASE (next slide). Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 5 / 15 Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 6 / 15 No ACID, BASE Instead Amazon Dynamo Dynamo was developed by Amazon. Transactions are no longer guaranteed to be ACID: atomic, consistent, First used for the shopping cart, now also for other applications. isolated, durable). BASE is almost the opposite: basically available, soft Goal: always available, writes never fail. state, eventually consistent. Key–value store. Records are replicated on several computers. Read & write: only single records. BASE is optimistic and accepts that the database consistency is in a state of ﬂux. “Eventual consistency” (actually more like durability) means that Operations: get(key) returns a value or a list of several versions of a inaccurate reads are permitted just as long as the data is synchronized value. The application must solve problems with inconsistencies. “eventually.” (Compare with DNS, it takes time for changes to propagate.) put(key,value) writes a value. The key is hashed, the hash code determines on which nodes the value should be stored (“consistent hashing”). Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 7 / 15 Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 8 / 15 Cassandra Computing Model First developed by Facebook, now a top-level Apache project. Not only storage should be distributed, but also computing. It is diﬃcult to write parallel programs . . . MapReduce is a new programming model. Key–value & replication like in Dynamo. All data is treated as sets of key–value pairs. The key is a string, the But the value has structure: it contains columns (which are stored in value is a blob. column families which may be stored in super columns). A column has a name, a value, and a timestamp. Columns may be sorted on All programs are sequences of alternating map and reduce functions. value or on timestamp. The map function processes a key–value pair and generates one or Inbox search at Facebook: 50+ TB of data stored on 150 machines. more intermediate key–value pairs. Term search: the user id is the key. Words in messages are the super The reduce function merges all intermediate values associated with columns, message id’s become the columns. the same intermediate key. Interaction search: the user id is the key. Recipient id’s are the super Map functions run in parallel on many computers, as do reduce columns, message id’s become the columns. functions Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 9 / 15 Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 10 / 15 MapReduce Example MapReduce Data Flow Compute word counts within a set of documents. map(key, value): // key: document name // value: document text for each word w in value: emitIntermediate(w, 1) reduce(key, values) // key: a word // values: a list of counts result = 0 for each v in values: result += v emit(result) Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 11 / 15 Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 12 / 15 MapReduce Figures (From Google) MapReduce vs Traditional Databases Execution on a cluster of 1800 machines, 2 × 2GHz processors, 4GB memory, 320GB disk, Gigabit Ethernet. The ﬁgures are from the original Data has no explicit schema MapReduce paper, 2004. The map and reduce functions must “understand” the data format. Users have to write procedural code to interpret and process the data. Grep Scan through 1010 100-byte records, searching for a A step backwards? three-character pattern. 150 seconds, including 60 seconds Higher-level programming languages for MapReduce: PIG, Hive. startup overhead. Data is stored in ﬁles in a distributed ﬁle system. Sort Sort 1010 100-byte records. 15 minutes. All processing is sort based — makes the programming easier, but Google Google web search uses an index which is created with may be a performance concern. MapReduce. Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 13 / 15 Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 14 / 15 More Information http://en.wikipedia.org/wiki/Nosql http://nosql-databases.org/ http://nosql.mypopescu.com/ http://www.vineetgupta.com/2010/01/ nosql-databases-part-1-landscape/ Per Holm (Per.Holm@cs.lth.se) Database Technology 2010/11 15 / 15
Pages to are hidden for
"NoSQL Overview Examples_ Lots of Data Buy a Bigger Computer "Please download to view full document