Learning Center
Plans & pricing Sign in
Sign Out

Hadoop The Hadoop


									The Hadoop Distributed File System:
Architecture and Design
by Dhruba Borthakur
Presented by Bryant Yao
   What is it? It’s a file system!
    ◦ Supports most of the operations a normal file
      system would.
 Open source implementation of GFS
  (Google File System).
 Written in Java
 Designed primarily for GNU/Linux
    ◦ Some support for Windows
Design Goals
 HDFS is designed to store large files (think TB or PB).
 HDFS is designed for a computer cluster/s made up of racks.

              Rack 1                                Rack 2
   Write once, read many model
    ◦ Useful for reading many files at once but not single files.
   Streaming access of data
    ◦ Data is coming to you constantly and not in waves
 Make use of commodity computers
 Expect hardware to fail
 “Moving computation is cheaper than moving data”
Master/Slave Architecture


Master/Slave Architecture cont.
 1 master, many slaves
 The master manages the file system namespace and regulates access to
  files by clients.

 Data distributed across slaves. The slaves store the data as “blocks”.
 What is a block?
  ◦ A portion of a file.
  ◦ Files are broken down into and stored as a sequence of blocks.
                              File 1

                          A      B     C
                 Broken down into blocks A, B, and C.
Task Flow
   Master
   Handles metadata operations
    ◦ Stored in a transaction log called EditLog
   Manages datanodes
    ◦ Passes I/O requests to datanodes
    ◦ Informs the datanode when to perform block operations.
    ◦ Maintains a BlockMap which keeps track of which blocks
      each datanode is responsible for.
   Stores all files’ metadata in memory
    ◦ File attributes, number of replicas, file’s blocks, block
      locations, and checksum of a block.
   Stores a copy of the namespace in the FsImage on
 Slave
 Handles data I/O.
 Handles block creation, deletion, and
 Local storage is optimized so files are
  stored over multiple file directories
    ◦ Storing data into a single directory
Data Replication
 Makes copies of the data!
 Replication factor determines the number
  of copies.
    ◦ Specified by namenode or during file creation
   Replication is pipelined!
Pipelining Data Replication
    Blocks are split into portions (4KB).

 1              2        3           Assume a block is
                                     split into 3 portions:
                                     A, B, and C.

            1        2           3

            B       A

                     1           2          3

                     C       B            A
Replication Policy
 Communication bandwidth between
  computers in a rack is greater than
  between a computer outside of the rack.
 We could replicate data across
  racks…but this would consume the most
 We could replicate data across all
  computers in a rack…but if the rack dies
  we’re in the same position as before.
Replication Policy cont.
   Assume only three replicas are created.
    ◦ Split the replicas between 2 racks.
    ◦ Rack failure is rare so we’re still able to maintain good data
      reliability while minimizing bandwidth cost.
   Version 0.18.0
    ◦ 2 replicas in current rack (2 different nodes)
    ◦ 1 replica in remote rack
   Version 0.20.3.x
    ◦ 1 replica in current rack
 2 replicas in remote rack (2 different nodes)
 What happens if replication factor is 2 or > 3?
    ◦ No answer in this paper.
    ◦ Some other papers state that the minimum is 3.
    ◦ The author wrote a separate paper stating every replica after
      the 3rd is placed randomly.
Reading Data
   Read the data that’s closest to you!
    ◦ If the block/replica of data you want is on the
      datanode/rack/data center you’re on, read it from
   Read from datanodes directly.
    ◦ Can be done in parallel.
   Namenode is used to generate the list of
    datanodes which host a requested file as
    well as getting checksum values to validate
    blocks retrieved from the datanodes.
Writing Data
   Data is written once
   Split into blocks, typically of size 64MB
    ◦ The larger the block size, the less metadata
      stored by the namenode
   Data is written to a temporary local block
    on the client side and then flushed to a
    datanode, once the block is full.
    ◦ If a file is closed while the temporary block isn’t
      full, the remaining data is flushed to the datanode.
   If the namenode dies during file creation, the
    file is lost!
Hardware Failure
       Imagine a file is broken into 3 blocks
       spread over three datanodes.

      1               2                3

      Block A       Block B          Block C

       If the third datanode died, we would have
       no access to block C and we can’t
       retrieve the file.

       1               2                3

      Block A        Block B          Block C
Designing for Hardware Failure
 Data replication
 Safemode
    ◦ Heartbeat
    ◦ Block report
 Checkpoints
 Re-replication

     EditLog      +          FsImage


               File System
 FsImage is a copy of the system taken
  before any changes have occurred.
 EditLog is a log of all the changes to the
  namenode since it’s startup.
 Upon the start up of the namenode, it
  applies all changes to the FsImage to
  create an up to date version of itself.
    ◦ The resulting FsImage is the checkpoint.
   If either the FsImage or EditLog is
    corrupt, the HDFS will not start!
Heartbeat and Blockreport
   A heartbeat is a message sent from the
    datanode to the namenode.
    ◦ Periodically sent to the namenode, letting the
      namenode know it’s “alive.”
    ◦ If it’s dead, assume you can’t use it.
   Blockreport
    ◦ A list of blocks the datanode is handling.
   Upon startup, the namenode enters
    “safemode” to check the health status of the
    cluster. Only done once.
   Heartbeat is used to ensure all datanodes
    are available to use.
   Blockreport is used to check data integrity.
    ◦ If the number of replicas retrieved is different
      from the number of replicas expected, there is a
        Replicated                   Found
        A   A    A                   A    A
 Can view file system through FS Shell or
  the web
 Communicates through TCP/IP
 File deletes are a move operation to a
  “trash” folder which auto-deletes files
  after a specified time (default is 6 hours).
 Rebalancer moves data from datanodes
  which have are close to filling up their
  local storage.
Relation with Search Engines
   Originally built for Nutch.
    ◦ Intended to be the backbone for a search engine.
   HDFS is the file system used by Hadoop.
    ◦ Hadoop also contains a MapReducer which has
      many applications, like indexing the web!
   Analyzing large amounts of data.
   Used by many, many companies
    ◦ Google,Yahoo!, Facebook, etc.
   It can store the web!
    ◦ Just kidding .
 The goal of this paper is to describe the
  system, not analyze it. It gives a great
  beginning overview.
 Probably could’ve been
  condensed/organized better.
 Some information is missing
    ◦ SecondaryNameNode
    ◦ CheckpointNode
    ◦ Etc.
Pros/Cons of HDFS
In and Beyond the Paper
   Pros
    ◦   It accomplishes everything it set out to do.
    ◦   Horizontally scalable – just add a new datanode!
    ◦   Cheap cheap cheap to build.
    ◦   Good for reading and storing large amounts of data.
   Cons
    ◦ Security
    ◦ No redundancy of namenode
         Single point of failure
    ◦ The namenode is not scalable
    ◦ Doesn’t handle small files well
    ◦ Still in development, many features missing
Thank you for listening!

To top