NoSQL-NET_and_NoSQL_Introducing_Cassandra by yvtong


									.NET and NoSQL
       Introducing Cassandra

{   John Zablocki
    Development Manager, HealtcareSource
    Organizer, Beantown ALT.NET
    Beantown ALT.NET
       New England Code Camp – 10/29/2011
       WP7 Location @ Dev Boston Meetup –
       DDD w/ Steve Bohlen @ Beantown ALT.NET –

Shameless Plugs
      NoSQL Overview
      Cassandra Basic Concepts
      Cassandra Data Model
      Client API
      Cassandra and .NET
      Questions?

    {   Not Only SQL
       Coined in 1998 by Carlos Strozzi to describe a
        database that did not expose a SQL interface
       In 2008, Eric Evans reintroduced the term to
        describe the growing non-RDBMS movement
       Broadly refers to a set of data stores that do not
        use SQL or a relational data model
       Popularized by large web presences such as
        Google, Facebook and Amazon

What is NoSQL?
NoSQL Databases
       NoSQL databases come in a variety of flavors *
           XML (myXMLDB, Tamino, Sedna)
           Tabular (Hbase, Big Table)
           Key/Value (Redis, Memcached with BerkleyDB)
           Object (db4o, JADE)
           Graph (Trinity, neo4j, InfoGrid)
           Document store (CouchDB, MongoDB)
           Eventually Consistent Key/Value Store
            (Cassandra, Dynamo)

            * loose taxonomies

NoSQL Databases
Why NoSQL?
       RDBMS Administrators are highly paid
       Highly paid individuals often buy larger than
        average homes or cars
       Larger than average homes and cars require
        more energy than smaller home and cars
       Therefore RDMBSs contribute to global
        warming more than NoSQL databases which
        typically do not require the addition of a DBA

RDBMs and the Environment
   RDBMSs often require high end servers and
    that are taxing on disks
   High end servers consume more electricity than
    mid-range servers
   Taxed disks fail more often than untaxed disks
   Therefore RDBMSs require more energy and
    produce more waste (lots of hard drives in
    landfills) than NoSQL DBs, which run on mid-
    range servers.
Even More Why NoSQL?
       The current healthcare crisis requires talented
        software engineers to fix the outdated or non-
        existent IT systems of the hospital system
       Talented software engineers spend a great deal
        of time mapping objects to tables in RDBMSs
       Talented software engineers are unable to fix
        healthcare because they are too busy mapping
        objects to tables
       Therefore RDBMSs are causing illnessnes

NoSQL and Healthcare
         Three disruptive technologies you should be
          paying attention to today…
             NoSQL databases and big data technologies -
              especially MongoDB, CouchDB, Cassandra,
              Hbase, MapReduce and Hadoop
             Evented I/O Web Servers – especially Node.js
              and to a lesser extent Tornado
             Functional programming languages – especially
              Scala, F# and Erlang

Please Pardon the Interruption…
Introducing Cassandra
       Open source, Apache supported project
       Originally written by Facebook for Inbox search feature
            FB now uses a proprietary fork
       Written in Java. Yes, Java.
       Column-oriented with row-oriented properties
       Schemaless
       Data stored in sparse, multidimensional hashtables
            Sparse meaning that rows may have one or more columns
       Distributed and Decentralized
       Highly Available and Fault Tolerant
       Elastic Scalability
       Tunable Consistency
       MapReduce via Hadoop

About Cassandra
       Column-Oriented
           Content stored by column, rather than by row:
                1,2,3;
           More efficient when an aggregate needs to be computed
            over many rows
           More efficient when writing new values for a column to
            all rows at once
           Better compression is possible, due to the fact that modern
            compression schemes make use of the similarity of
            adjacent data (column data is uniform)
           Less efficient for multi-column reads
           Less efficient for multi-column writes

        Cassandra is meant to run on multiple nodes
            Single node is possible, but Cassandra’s benefits
             will not be realized
        Every node is identical
            No Master/Slave
            Peer-to-peer protocol keeps data in sync (gossip)

Distributed and Decentralized
       Periodic, pairwise interactions
       Bounded size information exchange
       One agent changes the state of another
       Reliable communication is not assumed
       Low frequency of interactions to minimize
        protocol costs
       Some form of randomness in peer selection

Gossip Protocol
        Vertical Scaling
             Throw hardware at the problem
             More memory, faster CPU, etc.
        Horizontal Scaling (Clustering)
             Add more machines
             Possibly partition the data across machines
        Elastic Scaling
             Horizontal cluster that can scale up and scale down
             New nodes can be brought online and begin
              serving requests with partial data
             New nodes come online without service distruption

Elastic Scalability
       Consistency - ensures transactions move a database
        from one consistent state to another
       Cassandra supports tunable consistency
           Strict (sequential) consistency – all nodes see all
            writes in the same order
                A read always returns the most recent write
           Causal consistency – potentially causally related
            operations seen by all nodes in the same order
                Concurrent writes are not causally related
                Timestamps used to determine the cause of events
           Weak (eventual) consistency – all updates will
            propagate to all nodes, but not immediately
       See Eric Brewer’s CAP Theorem

Tunable Consistency
       Large-scale distributed systems have three
        competing requirements
            Consistency – all nodes see the same data at the
             same time
            Availability – All clients will always be able to read
             and write data and all requests will receive a
             response of success or failure
            Partition Tolerance – The system will continue to
             function, even in the face of network segmentation
       Theorem states that a distributed system can satisfy
        only 2 of these 3 properties at the same time

Brewer’s CAP Theorem
       Consistency and Availability
            Two-phase commit for distributed transactions
            System blocks on a network partition
       Consistency and Partition Tolerance
            Pessimistic locking
            Node failure hinders availability
       Availability and Partition Tolerance
            System always returns data, even if inaccurate
            Optimistic locking
            DNS, web caching

Brewer’s CAP Theorem
       Clusters (rings)
            Set of nodes that appear as a single server
            Single node is still a cluster
            Container for keyspaces
       Keyspaces
            Analogous to a relational database
            Has name and set of attributes to define keyspace-
             wide behavior
                 Replication factor (# of nodes will having row copy)
                 Replica placement strategy (how rows are copied)
                 Column Families

Cassandra Data Model
       Column Families
            Analogous to a relational table
            Container for an ordered collection of rows
       Columns
            Basic data structure in Cassandra
            Consists of a name, value and clock (timestamp)
            Defined with a key name sorting rule (ascii, integer, etc.)
                 Value sorting is not possible
            Names and values stored as Java byte arrays
            May be indexed for queries
       Super Columns
            A special column with values that are maps of subcolumns
             (standard columns)
            Single level of nesting only
            Subcolumns are not indexed – read a supercolumn and all of
             its columns are read as well

Cassandra Data Model
       System keyspace stores metadata about the cluster,
        similar to the master db in SQL Server
       Peer-to-peer distribution model where behavior of
        each node is identical (no Master/Slave)
            New node added to cluster without disruption
            Accepts requests only after learning topology
       Gossip protocol where gossiper runs every second
        on a timer
            Each node has information about the others
       Anti-entropy is the replica synchronization
        mechanism in Cassandra
            Nodes exchange hashes of column family data in
             order to determine whether read-repair is needed

Cassandra Architecture
       Writes are immediately written to a commit log
        and subsequently written to an in-memory
        store called the memtable
       At a specified threshold objects in the
        memtable are flushed to disk to an immutable
        structure called a sorted string table (SSTable)
       Hinted handoffs allow nodes to receive a write
        intended for another node if that other node
        goes offline. The hint tells the receiving node
        to update the offline node when back online

Cassandra Architecture
       Compaction is the operation of merging SSTables
            Keys are merged
            Columns are combined
            Tombstones are discarded
            New index created
            Merged data are sorted
       Bloom filters are used to reduce disk access
            Fast nondeterministic algorithms to determine
             whether an element is a member of a set
       Tombstones are deletion markers on records
            All delete commands in Cassandra are soft deletes

Cassandra Architecture
Using Cassandra
     {   The Windows Experience
         Install the Java 1.6 (or later) SDK
         Set environment variable JAVA_HOME set to
          the install path of the JDK
         Download the binaries from

         Unzip to Program Files (x86) or some other
          directory, optionally set PATH
         Set environment variable
          CASSANDRA_HOME to directory above
         In command line, navigate to bin under
          CASSANDRA_HOME and run cassandra

Installing Cassandra on Windows
       Command line interface
       Navigate to bin, under CASSANDRA_HOME
        and run cassandra-cli
       Generally useful for development, but not
        meant to be a full-blown client
       Allows for basic administration (creating
        keyspaces, column management, etc.)
       Commands must be terminated with a ;

       Connect to a server
           connect localhost/9160;
       Connect to a server at CLI start
           cassandra-cli localhost/9160
       System information commands
           show cluster name;
           show keyspaces;
           show api version;

       Create a keyspace
           create keyspace BeantownAltNet;
       Switch to keyspace
           use BeantownAltNet
       Create a column family
           create column family movies with
            comparator=UTF8Type and
       View information about column family
           describe keyspace BeantownAltNet;

   See this JIRA issue and then run (v..8):
      assume Movies keys as ascii;

   Add a row of data
        set movies*‘Goodfellas’+*‘Genre’+ = ‘Drama’;
        set movies*‘Goodfellas’+*‘Year’+ = 1990;
   Count the columns
      count movies*‘Goodfellas’+;

   Get the row and column
        get movies*‘Goodfellas’+;
        get movies*‘Goodfellas’+*‘Genre’+;
       Create an index on Genre
           update column family movies with
            index_type:0, index_name:IdxGenre,
       Query by genre
           get movies where Genre = ‘Drama’;
       Remove a column
           del movies*‘Goodfellas’+*‘Year’+;
       Remove a row
           del movies*‘Goodfellas’+;

        Used for Cassandra’s client API
        Effectively an RPC serialization mechanism
        Software framework for scalable, cross-language
         services development
        Combines software stack with code generation to
         build services
        Support for C++, Java, Python, PHP, Erlang, Perl,
         Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk
         and Ocaml
         struct UserProfile {
                    1: i32 uid,
                    2: string name,
                    3: string blurb
         service UserStorage {
                    void store(1: UserProfile user),
                    UserProfile retrieve(1: i32 uid)

          CQL is a DSL similar to SQL meant to abstract better the
           details of the server operations from the clients (still
           requires Thrift)
          Currently, CQL drivers exist only for Java and Python
          CREATE KEYSPACE BeantownAltNet with
          CREATE COLUMNFAMILY movies (
              key VARCHAR PRIMARY KEY,
              genre VARCHAR,
              year INT);
          INSERT INTO movies (key, genre, year) VALUES
           (‘Zoolander’, ‘Comedy’, 1996);
          SELECT key, genre, year FROM movies;
          SELECT key, genre, year FROM movies WHERE

Cassandra Query Language (CQL)
      Command line CQL tool that ships with the Python
       CQL driver
      Windows installation
           Grab the precompiled windows Thrift binaries for
            Python and copy to site-packages
           Download cassandra-dbapi2 from
            and run - install
           easy_install pyreadline
      Run - python cqlsh localhost 9160

       strategy_class=‘SimpleStrategy’ AND
      CREATE COLUMNFAMILY users (key
       VARCHAR PRIMARY KEY, nickname
      INSERT INTO users (key, nickname) VALUES
       (‘jzablocki’, ‘zblock’);
      SELECT * FROM users;

.NET and
     {   The Client Libraries
       Currently, there are three well maintained ,
        community sponsored client libraries
            Cassandra-Sharp -
            Aquiles -
            FluentCassandra -
       No official Apache client

.NET Client Libraries
       Configured in App/Web.config
       Simple API over most common Thrift calls
       Additional support for Cassandra commands
        via Execute method and Client class
       Support for executing CQL


   Cassandra-Sharp Demo
       Configured in App/Web.config
       Simple wrapper over most common Thrift calls
       No direct support for executing CQL (though
        an internal class does have CQL execution)


Aquiles Demo
       Intended to be an idiomatic .NET Cassandra
        framework (i.e., more like .NET than Java)
       Makes use of .NET 4.0 dynamic feature
       Raw Thrift commands are abstracted
       No current support for SQL
       Developerd by Nick Berardi


FluentCassandra Demo
     {   Codd is Dead
       Materialized View
           Store redundant data for more efficient queries
            MovieGenres*‘Drama’+*‘Goodfellas’+ = null;
            MovieGenres*‘Drama’+*‘Casino’+ = null;
       Valueless Column
           All data necessary to satisfy a query is in the
            column. No value needed (see above)
       Aggregate Key
           Combine values with a delimiter to create a
            composite key
            ZipCodes*‘Wethersfield:CT’+ = ‘06109’;
            ZipCodes*‘Cambridge:MA’+ = ‘02140’;

Design Patterns
     – my blog
     – code projects
        samples - code from this presentation
        - O’Reilly’s Cassandra - The Definitive Guide


To top