hbase_ put 'test'

Document Sample
hbase_ put 'test' Powered By Docstoc
					  Delip Rao
delip@jhu.edu
What is the typical size of data you
   deal with on a daily basis?
• Processes 20 Petabytes of raw data a day
• Works out to 231 TB per second!

• Numbers from 2008, grows by the day.
               http://www.niallkennedy.com/blog/2008/01/google-mapreduce-stats.html
• 200 GB per day (March 2008)
• 2+ TB (compressed) per day in April 2009
• 4+ TB (compressed) per day in Oct 2009
  – 15 TB (uncompressed) a day (2009)
           And Many More …
• eBay: 50 TB / day



• NYSE: 1TB / day



• CERN LHC : 15 PB / Year
Storage and Analysis of
Tera-scale Data : 1 of 2
    415 Database Class
        11/17/09
      In Today’s Class We Will ..
• Deal with scale from a completely different
  perspective
• Discuss problems with traditional approaches
• Discuss how to analyze large quantities of data
• Discuss how to store (physically) huge
  amounts of data in a scalable, reliable fashion
• Discuss a simple effective approach to store
  record-like data
             Dealing with scale
• MySQL will crawl with 500 GB of data
• “Enterprise” databases (Oracle, DB2)
  – Expensive $$$$$$$$
  – Does not scale well
     • Indexing painful
     • Aggregate operations (SELECT COUNT(*) ..) almost
       impossible
• Distributed databases
  – Expensive $$$$$$$$$$$$$$$$$$$$$$$$
  – Doesn’t scale well either
     Large Scale Data: Do we need
              databases?
• Traditional database design is inspired from
  decades of research on storage and retrieval.
• Complicated database systems == more tuning
  – A whole industry of “Database Administrators”
  – Result: Increased operational expenses
• Complicated indexing, transaction processing
  algorithms are not needed if all we care about
  are analyses from the data
    Parallelize both data access and
                processing
• Over time processing capacity has increased
  compared to
  – Disk transfer time (slow)
  – Disk seek time (even slower)
• Solution:
  – Process data using a cluster of nodes using
    independent CPUs and independent disks.
                  Overview
• MapReduce is a design pattern:
  – Manipulate large quantities of data
  – Abstracts away system specific issues
  – Encourage cleaner software engineering
  – Inspired from functional programming primitives
           MapReduce by Example
•   Output: Word frequency histogram
•   Input: Text, read one line at a time
•   Single core design: Use a hash table
•   MapReduce:
      def mapper(line):
        foreach word in line.split():
           output(word, 1)

      def reducer(key, values):
        output(key, sum(values))
Word Frequency Histogram (contd)


the quick
brown fox

the fox ate
the rabbit

the brown
  rabbit
 Word Frequency Histogram (contd)

                                           (ate, 1)
the quick     MAPPER
                                           (brown, 2)
brown fox                        REDUCER
                                           (fox, 2)




                       SHUFFLE
                                           (quick, 1)
the fox ate   MAPPER
the rabbit                                 (rabbit, 2)
                                             (the, 4)

                                 REDUCER
the brown
              MAPPER
  rabbit
            WordCount review
• Output: Word frequency histogram
• Input: Text, read one line at a time
  – Key: ignore, Value: Line of text
    def mapper(key, value):
      foreach word in value.split():
         output(word, 1)

    def reducer(key, values):
      output(key, sum(values))
    WordCount: In actual code
• Mapper
     WordCount: In actual code
• Reducer
     WordCount: In actual code
• Driver (main) method


                         Observe the benefits of
                         abstraction from
                         hardware dependence,
                         reliability and job
                         distribution
      “Thinking” in MapReduce
• Input is a sequence of key value pairs (records)
• Processing of any record is independent of the
  others
• Need to recast algorithms and sometimes
  data to fit to this model
  – Think of structured data (Graphs!)
     Example: Inverted Indexing
• Say you have a large (billions) collection of
  documents
• How do you efficiently find all documents that
  contain a certain word?
• Database solution:
       SELECT doc_id FROM doc_table
       where doc_text CONTAINS ‘word’;
• Forget scalability. Very inefficient.
• Another demonstration of when not to use a DB
     Example: Inverted Indexing
• Well studied problem in Information Retrieval
  community
• More about this in 600.466 course (Spring)
• For now, we will build a simple index
  – Scan all documents in the collection
                              What is the complexity
  – For each word record the document in which it
                                  of this code?

    appears
• Can write a few lines of Perl/Python to do it
  – Simple but will take forever to finish
 “Thinking” in MapReduce (contd)
• Building inverted indexes
• Input: Collection of documents is the latency of
                             What
                                  this code?
• Output: For each word find all documents
  with the word
    def mapper(filename, content):
      foreach word in content.split():
         output(word, filename)

    def reducer(key, values):
      output(key, unique(values))
           Suggested Exercise
• Twitter has data in following format:
<user_id, tweet_text, timestamp>
• Write map-reduces for
  – Finding all users who tweeted on “Comic Con”
  – Ranking all users by frequency of their tweets
  – Finding number of tweets containing “iPhone”
    varies with time
            MapReduce vs RDBMS
              Traditional RDBMS         MapReduce
Data size     Gigabytes                 Petabytes
Access        Interactive & Batch       Batch
Updates       Read & write many times   Write once, read many
Structure     Static schema             Dynamic schema
Integrity     High                      Low
Scaling       Nonlinear                 Linear
The Apache Hadoop Zoo


 PIG        CHUKWA          HIVE          HBASE


MAPREDUCE            HDFS          ZOOKEEPER


       COMMON                      AVRO
       Storing Large Data: HDFS
• Hadoop File System (HDFS)
• Very large distributed file system (~10PB)
• Assumes commodity hardware
  – Replication
  – Failure detection & Recovery
• Optimized for batch processing
• Single namespace for entire cluster
hdfs://node-21/user/smith/job21/input01.txt
               HDFS Concepts
• Blocks – A single unit of storage
• Namenode (master)
  – manages namespace
     • Filesystem namespace tree + metadata
  – Maintains file to block mapping
• Datanodes (workers)
  – Performs block level operations
            Storing record data
• HDFS is a filesystem: An abstraction for files
  with raw bytes and no structure
• Lot of real world data occur as tuples
  – Hence RDBMS. If only they were scalable …
• Google’s solution: BigTable (2004)
  – A scalable distributed multi-dimensional sorted
    map
• Currently used in 100+ projects inside Google
  – 70+ PB data; 30+ GB/s IO (Jeff Dean, LADIS ’09)
        Storing record data: HBase
• Open source clone of Google’s Bigtable
• Originally created in PowerSet in 2007
• Used in : Yahoo, Microsoft, Adobe, Twitter, …
• Distributed column-oriented database on top of
  HDFS
• Real time read/write random-access
• Not Relational and does not support SQL
• But can work with very large datasets
    – Billions of rows, millions of columns
           HBase: Data Model
• Data stored in labeled tables
  – Multi-dimensional sorted map
• Table has rows, columns.
• Cell: Intersection of a row & column
  – Cells are versioned (timestamped)
  – Contains an uninterrupted array of bytes
   (no type information)
• Primary key: A cell that uniquely identifies a
  row
     HBase: Data model (contd)
• Columns are grouped into column families
  – Eg., temperature:air, temperature:dew_point
• Thus column name is family_name:identifier
• The column families are assumed to be known
  a priori
• However can add new columns for an existing
  family at run time
      ROW           COLUMN FAMILIES
                    temperature:            humidity:           …
      location_id   temperature:air         humidity:absolute   …
                    temperature:dew_point   humidity:relative
                                            humidity:specific
      HBase: Data model (contd)
• Tables are partitioned into regions
• Region: Subset of rows
• Regions are units that get distributed across a
  cluster
• Locking
  – Row updates are atomic
  – Updating a cell will lock entire row.
  – Simple to implement. Efficient.
  – Also, updates are rare
            HBase vs. RDBMS
• Scale : Billions of rows and Millions of columns
• Traditional RDBMS:
  – Fixed schema
  – Good for small to medium volume applications
  – Scaling RDBMS involves violating Codd’s Rules,
    loosening ACID properties
 HBase schema design case study
• Store information about students, courses,
  course registration
• Relationships (Two one-to-many)
  – A student can take multiple courses
  – A course is taken by multiple students
   HBase schema design case study
• RDBMS solution

STUDENTS
id (primary key)
name                              COURSES
                   REGISTRATION
department_id                     id (primary key)
                   student_id
                                  title
                   course_id
                                  faculty_id
                   type
 HBase schema design case study
• HBase solution
   ROW            COLUMN FAMILIES
                  info:                course:
   <student_id>   info:name            course:course_id=type
                  info:department_id


   ROW            COLUMN FAMILIES
                  info:                student:
   <course_id>    info:title           student:student_id=type
                  info:faculty_id
                 HBase: A real example
  • Search Engine Query Log

  ROW                 COLUMN FAMILIES
                      query        cookie:             request:
  request_md5_hash    query:text   cookie:id           request:user_agent
                      query:lang                       request:ip_addr
                      ...                              request:timestamp



Common practice to use hash of                 This is okay when data is
the row of as key when no                      accessed sequentially. No
natural primary key exists                     need to “lookup”
          Suggested Exercise
• Write an RDBMS schema to model User-
  Follower network in Twitter
• Now, write its HBase equivalent
                       Access to HBase
• via Java API
   – Map semantics: Put, Get, Scan, Delete
   – Versioning support
• via the HBase shell $ hbase shell
                      ...
                             hbase>   create 'test' 'data'
                             hbase>   list
  Not a real query           test
  language. Useful for       …
  “inspecting” a table.      hbase>   put   'test',   'row1',   'data:1',   'value1'
                             hbase>   put   'test',   'row2',   'data:1',   'value1'
                             hbase>   put   'test',   'row2',   'data:2',   'value2'
  Not a preferred way to
                             hbase>   put   'test',   'row3',   'data:3',   'value3'
  access. Typically API is
  used.                      hbase> scan 'test'
                             …
                             hbase> disable 'test'
                             hbase> drop 'test'
   A Word on HBase Performance
• Original HBase had performance issues
• HBase 2.0 (latest release) much faster
  – Open source development!!!
• Performance Analysis by StumbleUpon.com
  – Website uses over 9b rows in a single HBase table
  – 1.2m row reads/sec using just 19 nodes
  – Scalable with more nodes
  – Caching (new feature) further improves
    performance
     Are traditional databases really
                required?
• Will a bank store all its data on HBase or its
  equivalents?
• Unlikely because Hadoop
  – Does not have a notion of a transaction
  – No security or access control like databases
• Fortunately batch processing large amounts of
  data does not require such guarantees
• Hot research topic: Integrate databases and
  MapReduce – “In database MapReduce”
                  Summary
• (Traditional) Databases are not Swiss-Army knives
• Large data problems require radically different
  solutions
• Exploit the power of parallel I/O and computation
• MapReduce as a framework for building reliable
  distributed data processing applications
• Storing large data requires redesign from the
  ground up, i.e. filesystem (HDFS)
               Summary (contd)
• HDFS : A reliable open source distributed file
  system
• HBase : A sorted multi-dimensional map for
  record oriented data
   – Not Relational
   – No query language other than map semantics (Get
     and Put)
• Using MapReduce + HBase involves a fair bit of
  programming experience
   – Next class we will study Pig and Hive: A “data analyst
     friendly” interface to processing large data.
          Suggested Reading
• Jeffrey Dean and Sanjay Ghemawat,
  “MapReduce: Simplified Data Processing on
  Large Clusters”
• Dewitt and Stonebraker, Mapreduce: “A major
  step backwards”
• Chu-Carroll, “Databases are hammers;
  MapReduce is a screwdriver”
• Dewitt and Stonebraker, “Mapreduce II”
         Suggested Reading (contd)
• Hadoop Overview
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
• Who uses Hadoop?
http://wiki.apache.org/hadoop/PoweredBy
• HDFS Architecture
http://hadoop.apache.org/common/docs/current/hdfs_design.html
• Chang et. al., “Bigtable: A Distributed Storage
  System for Structured Data”
http://labs.google.com/papers/bigtable.html
• Hbase
http://hadoop.apache.org/hbase/

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:11/12/2012
language:Unknown
pages:46
About Good!!!NICE!!! The best document database!