What is Hadoop Storage perspective

Document Sample
What is Hadoop Storage perspective Powered By Docstoc
					What is Hadoop (Storage perspective)?

Hadoop is a java frame work (software platform) for storing vast amounts of data (and
also process the data). It can be setup on commonly available computers.

Use Case

It can be used when following requirements arise

    Store terabytes of data: HDFS uses commonly available computers and storage
     devices and pools up the storage space on all the systems into one large piece.
    Streaming access of data: HDFS is designed more for batch processing rather than
     interactive use by users. The emphasis is on high throughput of data access rather
     than low latency of data access
    Large data sets: File sizes typically in gigabytes. HDFS is tuned to support large
     files. It should provide high aggregate data bandwidth and scale to hundreds of
     nodes in a single cluster
    WORM requirement: HDFS applications need a write-once-read-many access
     model for files. A file once created, written, and closed need not be changed. This
     assumption simplifies data coherency issues and enables high throughput data
    High availability: HDFS stores multiple instances of data on various systems in
     the cluster. This ensures availability of data even if systems come down.


Hadoop is based on Master-Slave architecture. An HDFS cluster consists of a single
Namenode (master server) that manages the file system namespace and regulates access
to files by clients. In addition, there are a number of Datanodes (Slaves), usually one per
node in the cluster, which manage storage attached to the nodes that they run on. HDFS
exposes a file system namespace and allows user data to be stored in files. Internally, a
file is split into one or more blocks and these blocks are stored in a set of Datanodes. The
Namenode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to Datanodes.
The Datanodes are responsible for serving read and write requests from the file system’s
clients. The Datanodes also perform block creation, deletion, and replication upon
instruction from the Namenode. The Namenode is the arbitrator and repository for all
HDFS metadata. The system is designed in such a way that user data never flows through
the Namenode.


         DataNode                                             DataNode

Key Features

    File System Namespace: HDFS Supports hierarchical file organization. It
     supports operations like create, remove, move & rename files as well as
     directories. It doesn’t have perms and quotas.
    Replication: HDFS Stores files as series of blocks. Blocks are replicated for fault
     tolerance. The replication factor and block size are configurable. Files in HDFS
     are write-once and have strictly one writer at any time. The replication placement
     is very critical for the performance. In large clusters nodes are spread across
     racks. Thee racks are connected via switches. Its observed that traffic within
     nodes in a rack is much higher than that across racks. Replicating data across
     racks saves network bandwidth. To minimize global bandwidth consumption and
     read latency, HDFS tries to satisfy a read request from a replica that is closest to
     the reader. If there exists a replica on the same rack as the reader node, then that
     replica is preferred to satisfy the read request.
    File System Metadata: Name node uses “EditLog” to record every change to file
     system metadata. The entire file system namespace is stored in file called
    Robustness: Network or Disk Failure and data integrity
     Datanodes send heartbeat messages to namenode and when namenode doesn’t
     receive them the datanode is marked dead. This may cause replication factor for
     some blocks fall. The name node constantly monitors the replication count for
     each block. If it falls then the namenode re replicates those nodes. This may
     happen because a replica may be corrupted, data node is dead or the replication
     count for a particular file may be increased. It also does rebalancing in case space
     on one node falls below a threshold value. Name stores checksum for each block
     and checks while retrieving.
    NameNode failure: The FsImage and the EditLog are central data structures of
     HDFS. A corruption of these files can cause the HDFS instance to be non-
     functional. For this reason, the Namenode can be configured to support
     maintaining multiple copies of the FsImage and EditLog. Any update to either the
     FsImage or EditLog causes each of the FsImages and EditLogs to get updated
     synchronously. This synchronous updating of multiple copies of the FsImage and
     EditLog may degrade the rate of namespace transactions per second that a
     Namenode can support. However, this degradation is acceptable because even
     though HDFS applications are very data intensive in nature, they are not
     metadata intensive. When a Namenode restarts, it selects the latest consistent
     FsImage and EditLog to use. The Namenode machine is a single point of failure
     for an HDFS cluster. If the Namenode machine fails, manual intervention is
     necessary. Currently, automatic restart and failover of the Namenode software to
     another machine is not supported.
    Data organization: Block size by HDFS is 64MB and it supports write once read
     many semantics. HDFS client has to do local catching. Suppose the HDFS file has
     a replication factor of three. When the local file accumulates a full block of user
     data, the client retrieves a list of Datanodes from the Namenode. This list contains
     the Datanodes that will host a replica of that block. The client then flushes the
     data block to the first Datanode. The first Datanode starts receiving the data in
     small portions (4 KB), writes each portion to its local repository and transfers that
     portion to the second Datanode in the list. The second Datanode, in turn starts
     receiving each portion of the data block, writes that portion to its repository and
     then flushes that portion to the third Datanode. Finally, the third Datanode writes
     the data to its local repository. Thus, a Datanode can be receiving data from the
     previous one in the pipeline and at the same time forwarding data to the next one
     in the pipeline. Thus, the data is pipelined from one Datanode to the next. When
     data is deleted it is not removed immediately removed rather it remains in /trash.
     It can be either removed or restored from there. How long to store the data in
     trash is configurable. Default value is 6 hrs.

How to access HDFS?

    DFSshell: from the shell user can create, remove and rename directories as well as
     files. This is intended for applications that use scripting languages to interact with
    Browser Interface: HDFS installation configures the web server to expose the
     HDFS namespace through a configurable TCP port. This allows a user to navigate
     the HDFS namespace and view the contents of its files using a web browser.
    For administration purpose DFSadmin command is also provided.
Setting up Hadoop

We used three Linux boxes with CentOS, to setup a hadoop cluster. Details as follows-

    DataNode-1                      NameNode                        DataNode-2
Property           NameNode            DataNode-1             DataNode-2

Storage Space        -                   35GB                  9GB
Hostname          NameNode            DataNode-1            DataNode-2
                                        name node
        data node                                                         data node

Step By Step Approach

                                     Multi Node Cluster
Step-1: (steps 1 to 5 needs to be done on all nodes)

Set the host names of three systems as indicated above. Added the entries in /etc/hosts
file as follows      localhost localhost.localdomain localhost DataNode-1 DataNode-2 NameNode DataNode-3

Then gave the following command on each of the three systems- “hostname XXX”, and
rebooted them.( XXX corresponds the hostname of each system)


Added a dedicated system user named hadoop.

[root@NameNode]#groupadd hadoop
[root@NameNode]#useradd –g hadoop hadoop
[root@NameNode]#passwd hadoop

Installed JDK (jdk-1_5_0_14-linux-i586.rpm) and hadoop (hadoop-0.14.4.tar.gz) as user
hadoop in /home/hadoop.


Setup the Linux systems in such a way that any system can ssh to any other system
without password. Copy public keys of every system in cluster (including itself) into
authorized_keys file.


Set JAVA_HOME variable in <hadoop install dir>/conf/ to correct path.
In our case it was “export JAVA_HOME=/usr/java/jdk1.5.0_14/”

Step-6: (on NameNode)

          Add following entry into <HADOOP_INSTALL>/conf/masters file

          Add following entry into <HADOOP_INSTALL>/conf/slaves file
The conf/slaves file on master is used only by the scripts   like bin/ or
bin/ for starting data nodes.

Step-7: (on data nodes)

Create a directory named hadoop-datastore (any name of your choice) where hadoop
stores all the data. The path of this directory needs to be mentioned in hadoop-site.xml
file for hadoop.temp.dir property.

Step- 8:

Change conf/hadoop-site.xml file. The file on NameNode looks as follows

              the name od the file system. the URI whose scheme
              determine the file system implementayion. the uri's
              scheme determines the config property( fs.scheme.impl)
               naming the FS implementation class. the uri's
              authority is used to determine host, port etc for
              the file system
         <description> the host and port that the mapreduce job tracker
         runs at. if local then jobs are run in-process as a single map
         and reduce task.
         <description> no of replications when a file is created.

On data nodes one extra property is added,

         <description> base for haddop temp directories</description>

Step- 9:

Format Hadoop's distributed filesystem (HDFS) for the namenode. You need to do this
the first time you set up a Hadoop cluster. Do not format a running Hadoop namenode,
this will cause all your data in the HDFS filesytem to be erased. The command is

<Hadoop – install>bin/hadoop namenode –format

The HDFS name table is stored on the namenode's (here: master) local filesystem in the
directory specified by The name table is used by the namenode to store
tracking and coordination information for the datanodes.

Run the command <HADOOP_INSTALL>/bin/ on the machine you want the
namenode to run on. This will bring up HDFS with the namenode running on the
machine you ran the previous command on, and datanodes on the machines listed in the
conf/slaves file.
Run the command <HADOOP_INSTALL>/bin/ on the namenode machine to
stop the cluster.


Hadoop comes with several web interfaces which are by default (see conf/hadoop-
default.xml) available at these locations:

       http://NameNode:50070/ - web UI for HDFS name node(s)

These web interfaces provide concise information about what's happening in your
Hadoop cluster. You may have to update hosts file in your windows system to resolve the
names to its IP.

From the NameNode you can do management as well as file operations via DFSshell.

The command <hadoop – installation> bin/hadoop dfs –help, gives you the operations
permitted by DFS. The command <hadoop – installation> bin/hadoop dfsadmin –help
gives the administration operations supported.


To add a new data node on fly just follow the above steps on new node and execute
following command on the new node to join the cluster.

bin/ --config <config_path> start datanode


To setup client machine install hadoop on a client machine and set the java_home
variable in To copy data to HDFS from client use fs switch of dfs and use
the URI of the namenode

bin/hadoop dfs -fs hdfs:// -mkdir remotecopy

bin/hadoop dfs -fs hdfs:// -copyFromLocal /home/devendra/jdk-1_5_0_14-linux-
i586-rpm.bin remotecopy
1. I/O handling.

See appendix for some test scripts and the Log analysis.

2. Fault Tolerance

    Observations on a two data-node cluster with replication factor 2.
             The data was accessible even if one of data nodes was down.
    Observations on a three data-node cluster with replication factor 2
             The data was accessible when one of the data-nodes was down
             Some data was accessible when two nodes were down
    Overflow condition: With a two data-node setup and nodes having free space of
     20 GB and 1 GB , tried to copy 10 GB of data. The copy operation was successful
     without any errors. Observed warning in the log messages indicating only one
     copy is done. ( I guess if we connect one more datanode on fly I suppose the data
     will be replicated on to the new system…will have to try this out to be sure)

    Accidental data loss: Even if we remove data-blocks from one of the data nodes,
     they will be synchronized(this was observed).
                       SCRIPT -1
The script copies 1 GB of data to the HDFS and back to the
local system indefinitely. The md5 checksum matches after
stopping the script. The script is executed from namenode.
echo "[`date +%X`] :: start script" >>log
echo "size of movies directory is 1 GB" >>log
echo "[`date +%X`] :: creating a directory Movies" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -mkdir Movies
if [ $? -eq 0 ]
echo "[`date +%X`] :: mkdir sucessful" >>log
while [ 1 = 1 ]
echo "------------------LOOP $i ------------------------" >>log
echo "[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/Movies
if [ $? -eq 0 ]
echo "[`date +%X`] :: copy sucessful" >>log
echo "[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/Movies
if [ $? -eq 0 ]
echo "[`date +%X`] :: remove sucessful" >>log
echo "[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/Movies
if [ $? -eq 0 ]
echo "[`date +%X`] :: move back sucessful" >>log
echo "[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/Movies
if [ $? -eq 0 ]
echo "[`date +%X`] :: move back sucessful" >>log
i=`expr $i + 1`


[03:48:52 PM] :: start script
size of movies directory is 1GB
[03:48:52 PM] :: creating a directory Movies
[03:48:54 PM] :: mkdir sucessful
------------------LOOP 0 ------------------------
[03:48:54 PM] :: coping data into the directory
[03:51:15 PM] :: copy sucessful
[03:51:15 PM] :: removing copy of file
[03:51:16 PM] :: remove sucessful
[03:51:16 PM] :: copying back to local system
[03:52:58 PM] :: move back sucessful
[03:52:58 PM] :: removing the file from hadoop
[03:53:01 PM] :: move back sucessful
------------------LOOP 1 ------------------------
[03:53:01 PM] :: coping data into the directory
[03:55:23 PM] :: copy sucessful
[03:55:23 PM] :: removing copy of file
[03:55:24 PM] :: remove sucessful
[03:55:24 PM] :: copying back to local system
[03:57:03 PM] :: move back sucessful
[03:57:03 PM] :: removing the file from hadoop
[03:57:06 PM] :: move back sucessful
------------------LOOP 2 ------------------------
[03:57:06 PM] :: coping data into the directory
[03:59:26 PM] :: copy successful

Copying 1GB data from file system to hadoop on a LAN of speed 100Mbps
took 140 seconds on average( observe the text in green). This turned to
be at speed of 58 Mbps.
Copying 1GB of data from file system to hadoop took 100 seconds on
average(observations in blue).This turned to be at speed of 80 Mbps.

Multithreaded script
Two threads are spawned, each of which copy data from the local system into hadoop and
back from hadoop to local system infinitely. Logs are captured to analyze the I/O
performance. The script was run for 48 hours and 850 loops got executed.
echo "thread1:[`date +%X`] :: start thread1" >>log
echo "thread1:size of thread1 directory 640MB" >>log
while [ 1 = 1 ]
echo "thread1:------------------LOOP $i ------------------------" >>log
echo "thread1:[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/thread1
if [ $? -eq 0 ]
echo "thread1:[`date +%X`] :: copy sucessful" >>log
echo "thread1:[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/thread1
if [ $? -eq 0 ]
echo "thread1:[`date +%X`] :: remove sucessful" >>log
echo "thread1:[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/thread1 /home/hadoop/
if [ $? -eq 0 ]
echo "thread1:[`date +%X`] :: move back sucessful" >>log
echo "thread1:[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/thread1
if [ $? -eq 0 ]
echo "thread1:[`date +%X`] :: deletion sucessful" >>log
i=`expr $i + 1`
echo "thread2:[`date +%X`] :: start thread2" >>log
echo "thread2:size of thread2 directory 640MB" >>log
while [ 1 = 1 ]
echo "thread2:------------------LOOP $j ------------------------" >>log
echo "thread2:[`date +%X`] :: coping data into the directory" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyFromLocal /home/hadoop/thread2
if [ $? -eq 0 ]
echo "thread2:[`date +%X`] :: copy sucessful" >>log
echo "thread2:[`date +%X`] :: removing copy of file " >>log
rm -rf /home/hadoop/thread2
if [ $? -eq 0 ]
echo "thread2:[`date +%X`] :: remove sucessful" >>log
echo "thread2:[`date +%X`] :: copying back to local system" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -copyToLocal /user/hadoop/thread2 /home/hadoop/
if [ $? -eq 0 ]
echo "thread2:[`date +%X`] :: move back sucessful" >>log
echo "thread2:[`date +%X`] :: removing the file from hadoop" >>log
/home/hadoop/hadoop-0.14.4/bin/hadoop dfs -rmr /user/hadoop/thread2
if [ $? -eq 0 ]
echo "thread2:[`date +%X`] :: deletion sucessful" >>log
j=`expr $j + 1`
Messages from thread1 are in black and that of thread2 are in green
thread1:[05:15:00 PM] :: start thread1
thread2:[05:15:00 PM] :: start thread2
thread1:size of thread1 directory 640 MB
thread2:size of thread2 directory 640MB
thread1:------------------LOOP 0 ------------------------
thread2:------------------LOOP 0 ------------------------
thread1:[05:15:00 PM] :: coping data into the directory
thread2:[05:15:00 PM] :: coping data into the directory            139 seconds ( to write)
thread1:[05:17:19 PM] :: copy sucessful                           NOTE
thread1:[05:17:19 PM] :: removing copy of file
thread1:[05:17:20 PM] :: remove sucessful                   152 seconds( to write)
thread1:[05:17:20 PM] :: copying back to local system
thread2:[05:17:32 PM] :: copy sucessful
thread2:[05:17:32 PM] :: removing copy of file
thread2:[05:17:33 PM] :: remove sucessful
thread2:[05:17:33 PM] :: copying back to local system           110 Seconds ( to read)
thread2:[05:19:23 PM] :: move back sucessful
thread2:[05:19:23 PM] :: removing the file from hadoop
thread1:[05:19:26 PM] :: move back sucessful
thread1:[05:19:26 PM] :: removing the file from hadoop
thread2:[05:19:28 PM] :: deletion sucessful
thread1:[05:19:29 PM] :: deletion sucessful
thread1:------------------LOOP 1 ------------------------
thread1:[05:19:29 PM] :: coping data into the directory
thread2:------------------LOOP 1 ------------------------
thread2:[05:19:29 PM] :: coping data into the directory
thread1:[05:21:43 PM] :: copy sucessful
thread1:[05:21:43 PM] :: removing copy of file
thread1:[05:21:44 PM] :: remove sucessful
thread1:[05:21:44 PM] :: copying back to local system
thread2:[05:21:48 PM] :: copy sucessful
thread2:[05:21:48 PM] :: removing copy of file
thread2:[05:21:49 PM] :: remove sucessful                              120 seconds ( to read)
thread2:[05:21:49 PM] :: copying back to local system
thread1:[05:23:44 PM] :: move back sucessful
thread1:[05:23:44 PM] :: removing the file from hadoop
thread1:[05:23:49 PM] :: deletion sucessful                  125 seconds ( to read )
thread1:------------------LOOP 2 ------------------------
thread1:[05:23:49 PM] :: coping data into the directory
thread2:[05:23:49 PM] :: move back sucessful

NOTE: Copying started at same time for thread1 and thread2 and also finished at about same time. That
Means in 152 seconds 1.28 GB of data was transferred. The average data throughput was 70 Mbps.
So is it that more the systems in cluster higher the speed( though there would be
definitely a saturation point where the speed will come down) ?


MapReduce is a programming model for processing and large data sets. A map function
processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce
function that merges all intermediate values associated with the same intermediate key.
Many real world tasks are expressible in this model.
Programs written in this functional style are automatically parallelized and executed on a
large cluster of commodity machines. The run-time system(hadoop) takes care of the
details of partitioning the input data, scheduling the program's execution across a set of
machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large distributed system.

For implementation details of map-reduce in hadoop follow the link below

Follow the link below for clear understanding on MapReduce
Sample MapReduce implementation

The program will mimic the wordcount example, i.e. it reads text files and counts how
often words occur. The input is text files and the output is text files, each line of which
contains a word and the count of how often it occurred, separated by a tab. The "trick"
behind the following Python code is that we will use hadoopstreaming for helping us
passing data between our Map and Reduce code via STDIN (standard input) and
STDOUT (standard output). We will simply use Python's sys.stdin to read input data
and print our own output to sys.stdout.
Save the file in and respectively.( requires python 2.4 or greater)
in /home/hadoop and give executable permissions to them. One needs to start
MapReduce deamons before submitting jobs – “bin/”

It will read data from STDIN (standard input), split it into words and output a list of lines
mapping words to their (intermediate) counts to STDOUT (standard output)

#!/usr/bin/env python
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words while removing any empty strings
    words = filter(lambda word: word, line.split())
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

It will read the results of from STDIN (standard input), and sum the
occurences of each word to a final count, and output its results to STDOUT (standard

#!/usr/bin/env python
from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

       # parse the input we got from
       word, count = line.split()
       # convert count (currently a string) to int
            count = int(count)
            word2count[word] = word2count.get(word, 0) + count
       except ValueError:
            # count was not a number, so silently
            # ignore/discard this line

# sort the words lexigraphically;
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print '%s\t%s'% (word, count)

Test the code as follows

[hadoop@NameNode ~]$echo "foo foo quux labs foo bar quux" |
/home/hadoop/ | /home/hadoop/

bar        1
foo        3
labs       1
quux       2

Implementation on hadoop

Copy some large plain text files ( typically in GB’s) into some local directory say
text.Copy the data into HDFS

[hadoop@NameNode ~]$hadoop dfs –copyFromLocal              /path/to/test test

Run the mapreduce job
[hadoop@NameNode~]$bin/hadoop jar contrib/hadoop-streaming.jar -mapper
/home/hadoop/ -reducer /home/hadoop/ -input test/* -
output mapreduce-output
The results can be viewed at http://localhost:50030/ or one can copy the output to local
system “ hadoop dfs –copyToLocal mapreduce-output”

Inverted Index: Example of a mapreduce job

Suppose there are three documents with some text content and we have to compute the
inverted index using map-reduce.

Doc-1                    Doc-2                  Doc-3
Hello                    Hello                  World
World                    India                   is
welcome                                          welcoming
to                                                India

                                    Map Phase

<Hello,Doc-1>             <Hello,Doc-2>           <World,Doc-3>
<World,Doc-1>              <India,Doc-2>          <is,Doc-3>
<welcome,Doc-1>                                   <welcoming,Doc-3>
<to,Doc-1>                                        <India,Doc-3>

                                    Reduce Phase

                        <Hello,[ Doc-1,Doc-2 ] >
                        <World,[Doc-1,Doc-3 ] >
                        <welcome,[ Doc-1,Doc-3 ] >
                        <India, [Doc-1,Doc-2,Doc-3] >

Words such as “to”, “is” etc are considered noise and should be filtered appropriately.

Shared By: