From Wikipedia, the free encyclopedia Apache Hadoop
Apache Hadoop
Apache Hadoop and, failing that, on the same rack/switch, so reducing
backbone traffic. The Hadoop Distributed File System
(HDFS) uses this when replicating data, to try to keep dif-
ferent copies of the data on different racks. The goal is to
Developer(s) Apache Software Foundation reduce the impact of a rack power outage or switch fail-
ure so that even if these events occur, the data may still
Stable release 0.20.204 / September 11,
be readable.[8]
2011 (2011-09-11)
Preview release 0.21.0 / August 23, 2010 (2010-08-23)
Development Active
status
Written in Java
Operating system Cross-platform
Type Distributed File System
License Apache License 2.0
Website hadoop.apache.org
Apache Hadoop is a software framework that supports
data-intensive distributed applications under a free li-
cense.[1] It enables applications to work with thousands A multi-node Hadoop cluster
of nodes and petabytes of data. Hadoop was inspired by
Google’s MapReduce and Google File System (GFS) pa- A small Hadoop cluster will include a single master
pers. and multiple worker nodes. The master node consists
Hadoop is a top-level Apache project being built and of a JobTracker, TaskTracker, NameNode, and DataNode.
used by a global community of contributors,[2] written A slave or worker node acts as both a DataNode and
in the Java programming language. Yahoo! has been the TaskTracker, though it is possible to have data-only
largest contributor[3] to the project, and uses Hadoop ex- worker nodes, and compute-only worker nodes; these
tensively across its businesses.[4] are normally only used in non-standard applications.
Hadoop was created by Doug Cutting,[5] who named it Hadoop requires JRE 1.6 or higher. The standard startup
after his son’s toy elephant.[6] It was originally developed and shutdown scripts require ssh to be set up between
to support distribution for the Nutch search engine pro- nodes in the cluster.
ject.[7] In a larger cluster, the HDFS is managed through a
dedicated NameNode server to host the filesystem index,
Architecture and a secondary NameNode that can generate snapshots
of the namenode’s memory structures, thus preventing
Hadoop consists of the Hadoop Common, which provides filesystem corruption and reducing loss of data. Similar-
access to the filesystems supported by Hadoop. The ly, a standalone JobTracker server can manage job sched-
Hadoop Common package contains the necessary JAR uling. In clusters where the Hadoop MapReduce engine is
files and scripts needed to start Hadoop. The package also deployed against an alternate filesystem, the NameNode,
provides source code, documentation, and a contribution secondary NameNode and DataNode architecture of
section which includes projects from the Hadoop Com- HDFS is replaced by the filesystem-specific equivalent.
munity.
For effective scheduling of work, every Hadoop-com- Filesystems
patible filesystem should provide location awareness: the
name of the rack (more precisely, of the network switch) Hadoop Distributed File System
where a worker node is. Hadoop applications can use this The HDFS is a distributed, scalable, and portable filesys-
information to run work on the node where the data is, tem written in Java for the Hadoop framework. Each
1
From Wikipedia, the free encyclopedia Apache Hadoop
node in a Hadoop instance typically has a single datan- ways available. This can have a significant impact on the
ode; a cluster of datanodes form the HDFS cluster. The performance of job completion times, which has been
situation is typical because each node does not require a demonstrated when running data intensive jobs.[11]
datanode to be present. Each datanode serves up blocks Another limitation of HDFS is that it cannot be direct-
of data over the network using a block protocol specific ly mounted by an existing operating system. Getting data
to HDFS. The filesystem uses the TCP/IP layer for com- into and out of the HDFS file system, an action that often
munication; clients use RPC to communicate between needs to be performed before and after executing a job,
each other. The HDFS stores large files (an ideal file size can be inconvenient. A Filesystem in Userspace (FUSE)
is a multiple of 64 MB[9]), across multiple machines. It virtual file system has been developed to address this
achieves reliability by replicating the data across multi- problem, at least for Linux and some other Unix systems.
ple hosts, and hence does not require RAID storage on File access can be achieved through the native Java
hosts. With the default replication value, 3, data is stored API, the Thrift API to generate a client in the language
on three nodes: two on the same rack, and one on a dif- of the users’ choosing (C++, Java, Python, PHP, Ruby, Er-
ferent rack. Data nodes can talk to each other to rebalan- lang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the
ce data, to move copies around, and to keep the replica- command-line interface, or browsed through the HDFS-
tion of data high. HDFS is not fully POSIX compliant be- UI webapp over HTTP.
cause the requirements for a POSIX filesystem differ from
the target goals for a Hadoop application. The tradeoff Other Filesystems
of not having a fully POSIX compliant filesystem is in- By May 2011, the list of supported filesystems included:
creased performance for data throughput. The HDFS was • HDFS: Hadoop’s own rack-aware filesystem.[12] This
designed to handle very large files.[9] is designed to scale to tens of petabytes of storage
The HDFS does not provide high availability, because and runs on top of the filesystems of the underlying
an HDFS filesystem instance requires one unique server, operating systems.
the name node. This is a single point of failure for an HDFS • Amazon S3 filesystem. This is targeted at clusters
installation. If the name node goes down, the filesystem hosted on the Amazon Elastic Compute Cloud server-
is offline. When it comes back up, the name node must re- on-demand infrastructure. There is no rack-
play all outstanding operations. This replay process can awareness in this filesystem, as it is all remote.
take over half an hour for a big cluster.[10] The filesystem • CloudStore (previously Kosmos Distributed File
includes what is called a Secondary Namenode, which mis- System), which is rack-aware.
leads some people into thinking that when the Primary • FTP Filesystem: this stores all its data on remotely
Namenode goes offline, the Secondary Namenode takes accessible FTP servers.
over. In fact, the Secondary Namenode regularly con- • Read-only HTTP and HTTPS file systems.
nects with the Primary Namenode and builds snapshots Hadoop can work directly with any distributed file sys-
of the Primary Namenode’s directory information, which tem that can be mounted by the underlying operating
is then saved to local/remote directories. These check- system simply by using a file:// URL; however, this comes
pointed images can be used to restart a failed Primary at a price: the loss of locality. To reduce network traffic,
Namenode without having to replay the entire journal Hadoop needs to know which servers are closest to the
of filesystem actions, then edit the log to create an up- data; this is information which Hadoop-specific filesys-
to-date directory structure. Since Namenode is the single tem bridges can provide.
point for storage and management of metadata, this can Out-of-the-box, this includes Amazon S3, and the
be a bottleneck for supporting huge number of files, es- CloudStore filestore, through s3:// and kfs:// URLs di-
pecially large number of small files. HDFS Federation is a rectly.
new addition which aims to tackle this problem to a cer- A number of third party filesystem bridges have also
tain extent by allowing multiple namespaces served by been written, none of which are currently in Hadoop dis-
separate Namenodes . tributions. These may offer superior availability or scal-
An advantage of using the HDFS is data awareness ability, and possibly a more general-purpose filesystem
between the jobtracker and tasktracker. The jobtracker than HDFS, which is biased towards large files and only
schedules map/reduce jobs to tasktrackers with an offers a subset of the expected semantics of a Posix
awareness of the data location. An example of this would Filesystem: no locking, or writing to anywhere other
be if node A contained data (x,y,z) and node B contained than the tail of a file.
data (a,b,c). The jobtracker will schedule node B to per- • In 2009 IBM discussed running Hadoop over the IBM
form map/reduce tasks on (a,b,c) and node A would be General Parallel File System.[13] The source code was
scheduled to perform map/reduce tasks on (x,y,z). This published in October 2009.[14]
reduces the amount of traffic that goes over the network • In April 2010, Parascale published the source code to
and prevents unnecessary data transfer. When Hadoop run Hadoop against the Parascale filesystem.[15]
is used with other filesystems this advantage is not al-
2
From Wikipedia, the free encyclopedia Apache Hadoop
• In April 2010, Appistry released a Hadoop filesystem however, a single task can be executed on multiple
driver for use with its own CloudIQ Storage slave nodes.
product.[16]
• In June 2010, HP discussed a location-aware IBRIX Scheduling
Fusion filesystem driver.[17] By default Hadoop uses FIFO, and optional 5 scheduling
• In May 2011, MapR Technologies, Inc. announced the priorities to schedule jobs from a work queue.[18] In ver-
availability of an alternate filesystem for Hadoop, sion 0.19 the job scheduler was refactored out of the
which replaced the HDFS file system with a full JobTracker, and added the ability to use an alternate
random-access read/write file system, with scheduler (such as the Fair scheduler or the Capacity sched-
advanced features like snaphots and mirrors, and get uler).[19]
rid of the single point of failure issue of the default
Fair scheduler
HDFS NameNode.
The fair scheduler was developed by Facebook. The goal
of the fair scheduler is to provide fast response times for
JobTracker and TaskTracker: the
small jobs and QoS for production jobs. The fair scheduler
MapReduce engine has three basic concepts.[20]
Above the file systems comes the MapReduce engine, 1. Jobs are grouped into Pools.
which consists of one JobTracker, to which client appli- 2. Each pool is assigned a guaranteed minimum share.
cations submit MapReduce jobs. The JobTracker pushes 3. Excess capacity is split between jobs.
work out to available TaskTracker nodes in the cluster, By default jobs that are uncategorized go into a default
striving to keep the work as close to the data as possible. pool. Pools have to specify the minimum number of map
With a rack-aware filesystem, the JobTracker knows slots, reduce slots, and a limit on the number of running
which node contains the data, and which other machines jobs.
are nearby. If the work cannot be hosted on the actual
Capacity scheduler
node where the data resides, priority is given to nodes in
The capacity scheduler was developed by Yahoo. The ca-
the same rack. This reduces network traffic on the main
pacity scheduler supports several features which are
backbone network. If a TaskTracker fails or times out,
similar to the fair scheduler.[21]
that part of the job is rescheduled. The TaskTracker on
• Jobs are submitted into queues.
each node spawns off a separate Java Virtual Machine
• Queues are allocated a fraction of the total resource
process to prevent the TaskTracker itself from failing if
capacity.
the running job crashes the JVM. A heartbeat is sent from
• Free resources are allocated to queues beyond their
the TaskTracker to the JobTracker every few minutes to
total capacity.
check its status. The Job Tracker and TaskTracker status
• Within a queue a job with a high level of priority will
and information is exposed by Jetty and can be viewed
have access to the queue’s resources.
from a web browser.
There is no preemption once a job is running.
If the JobTracker failed on Hadoop 0.20 or earlier, all
ongoing work was lost. Hadoop version 0.21 added some
checkpointing to this process; the JobTracker records
Other applications
what it is up to in the filesystem. When a JobTracker The HDFS filesystem is not restricted to MapReduce jobs.
starts up, it looks for any such data, so that it can restart It can be used for other applications, many of which are
work from where it left off. In earlier versions of Hadoop, under development at Apache. The list includes the
all active work was lost when a JobTracker restarted. HBase database, the Apache Mahout machine learning
Known limitations of this approach are: system, and the Apache Hive Data Warehouse system.
• The allocation of work to TaskTrackers is very Hadoop can in theory be used for any sort of work that
simple. Every TaskTracker has a number of available is batch-oriented rather than real-time, that is very data-
slots (such as "4 slots"). Every active map or reduce intensive, and able to work on pieces of the data in par-
task takes up one slot. The Job Tracker allocates allel. As of October 2009, commercial applications of
work to the tracker nearest to the data with an Hadoop[22] included:
available slot. There is no consideration of the • Log and/or clickstream analysis of various kinds
current system load of the allocated machine, and • Marketing analytics
hence its actual availability. • Machine learning and/or sophisticated data mining
• If one TaskTracker is very slow, it can delay the • Image processing
entire MapReduce job - especially towards the end of • Processing of XML messages
a job, where everything can end up waiting for the • Web crawling and/or text processing
slowest task. With speculative-execution enabled, • General archiving, including of relational/tabular
data, e.g. for compliance
3
From Wikipedia, the free encyclopedia Apache Hadoop
Prominent users •
•
Last.fm
LinkedIn[29]
• Microsoft[30]
Yahoo! • Meebo
On February 19, 2008, Yahoo! Inc. launched what it • Mendeley
claimed was the world’s largest Hadoop production ap- • Metaweb
plication. The Yahoo! Search Webmap is a Hadoop appli- • Netflix[31]
cation that runs on more than 10,000 core Linux cluster • The New York Times
and produces data that is now used in every Yahoo! Web • Ning
search query.[23] • Outbrain
There are multiple Hadoop clusters at Yahoo!, and • Playdom (now part of Disney Interactive Media
no HDFS filesystems or MapReduce jobs are split across Group)
multiple datacenters. Every hadoop cluster node boot- • Powerset (now part of Microsoft)
straps the Linux image, including the Hadoop distribu- • Rackspace
tion. Work that the clusters perform is known to include • Razorfish
the index calculations for the Yahoo! search engine. • StumbleUpon[32]
On June 10, 2009, Yahoo! made available the source • Twitter
code to the version of Hadoop it runs in production.[24] • Mitula[33]
Yahoo! contributes back all work it does on Hadoop to
the open-source community, the company’s developers
also fix bugs and provide stability improvements inter-
Hadoop on Amazon EC2/S3 ser-
nally, and release this patched source code so that other vices
users may benefit from their effort.
It is possible to run Hadoop on Amazon Elastic Compute
Facebook Cloud (EC2) and Amazon Simple Storage Service (S3).[34]
As an example The New York Times used 100 Amazon
In 2010 Facebook claimed that they have the largest EC2 instances and a Hadoop application to process 4 TB
Hadoop cluster in the world with 21 PB of storage.[25] On of raw image TIFF data (stored in S3) into 11 million fin-
July 27, 2011 they announced the data has grown to 30 ished PDFs in the space of 24 hours at a computation cost
PB.[26] of about $240 (not including bandwidth).[35]
There is support for the S3 filesystem in Hadoop dis-
Other users tributions, and the Hadoop team generates EC2 machine
Besides Facebook and Yahoo!, many other organizations images after every release. From a pure performance per-
are using Hadoop to run large distributed computations. spective, Hadoop on S3/EC2 is inefficient, as the S3
Some of the notable users include:[2] filesystem is remote and delays returning from every
• 1&1 write operation until the data is guaranteed not to be
• A9.com lost. This removes the locality advantages of Hadoop,
• About.com which schedules work near data to save on network load.
• Amazon.com
• AOL Amazon Elastic MapReduce
• Apple[27] Elastic MapReduce was introduced by Amazon in April
• Booz Allen Hamilton 2009. Provisioning of the Hadoop cluster, running and
• Cerner terminating jobs, and handling data transfer between
• ChaCha EC2 and S3 are automated by Elastic MapReduce. Apache
• comScore[28] Hive, which is built on top of Hadoop for providing data
• EHarmony warehouse services, is also offered in Elastic MapRe-
• eBay duce.[36]
• Federal Reserve Board of Governors Support for using Spot Instances was later added in
• foursquare August 2011.[37] Elastic MapReduce is fault tolerant for
• Fox Interactive Media slave failures,[38] and it is recommended to only run the
• Freebase Task Instance Group on spot instances to take advantage
• Hewlett-Packard of the lower cost while maintaining availability. [39]
• IBM
• ImageShack
• ISI
• Joost
4
From Wikipedia, the free encyclopedia Apache Hadoop
Hadoop at Google and IBM • IBM offers InfoSphere BigInsights[48] based on
Hadoop in both a basic and enterprise edition.[49]
IBM and Google announced an initiative in 2007 to use • In March 2011, Platform Computing announced
Hadoop to support university courses in distributed com- support for the Hadoop MapReduce API in its
puter programming.[40] Symphony software.[50]
In 2008 this collaboration, the Academic Cloud Com- • In May 2011, MapR Technologies, Inc. announced the
puting Initiative (ACCI), partnered with the National availability of their distributed filesystem and
Science Foundation to provide grant funding to academic MapReduce engine, the MapR Distribution for
researchers interested in exploring large-data applica- Apache Hadoop.[51] The MapR product includes most
tions. This resulted in the creation of the Cluster Ex- Hadoop eco-system components and adds
ploratory (CLuE) program.[41] capabilities such as snapshots, mirrors, NFS access
and full read-write file semantics.[52]
Running Hadoop in compute • Silicon Graphics International offers Hadoop
optimized solutions based on the SGI Rackable and
farm environments CloudRack server lines with implementation
services.[53]
Hadoop can also be used in compute farms and high-per-
• EMC released EMC Greenplum Community Edition and
formance computing environments. Instead of setting up
EMC Greenplum HD Enterprise Edition in May 2011. The
a dedicated Hadoop cluster, an existing compute farm
community edition, with optional for-fee technical
can be used if the resource manager of the cluster is
support, consists of Hadoop, HDFS, HBase, Hive, and
aware of the Hadoop jobs, and thus Hadoop jobs can be
the ZooKeeper configuration service. The enterprise
scheduled like other jobs in the cluster.
edition is an offering based on the MapR product,
and offers proprietary features such as snapshots
Grid Engine Integration and wide area replication.[54][55]
Integration with Sun Grid Engine was released in 2008, • In June 2011, Yahoo! and Benchmark Capital formed
and running Hadoop on Sun Grid (Sun’s on-demand util- Hortonworks Inc., whose focus is on making Hadoop
ity computing service) was possible.[42] In the initial im- more robust and easier to install, manage and use for
plementation of the integration, the CPU-time scheduler enterprise users.[56]
has no knowledge of the locality of the data. Unfortu- • Google added AppEngine-MapReduce to support
nately, this means that the processing is not always done running Hadoop 0.20 programs on Google App
on the same rack as the data; this was a key feature Engine.[57][58]
of the Hadoop Runtime. An improved integration with • In Oct 2011, Oracle announced the Big Data Appliance,
data-locality was announced during the Sun HPC Soft- which integrates Hadoop, Oracle Enterprise Linux,
ware Workshop ’09.[43] the R programming language, and a NoSQL database
In 2008-2009 Sun released the Hadoop Live CD OpenSo- with the Exadata hardware.[59][60]
laris project, which allows running a fully functional • Dovestech has released Ocean Sync Hadoop
Hadoop cluster using a live CD.[44] This distribution in- Management Software Freeware Edition. The
cludes Hadoop 0.19 -as of April 2010 there has not been software allows users to control and monitor all
an updated release. aspects of an Hadoop cluster[61].
• Grand Logic’s JobServer[62] product allows
Condor Integration developers and admins to deploy, manage and
The Condor High-Throughput Computing System inte- monitor their Hadoop infrastructure, with support
gration was presented at the Condor Week conference in for Hadoop job processing and HDFS file/content
2010.[45] management.
Commercially supported ASF’s view on the use of "Hadoop" in
product names
Hadoop-related products The Apache Software Foundation has stated that only
There are a number of companies offering commercial software officially released by the Apache Hadoop Pro-
implementations and/or providing support for ject can be called Apache Hadoop or Distributions of Apache
Hadoop.[46] Hadoop.[63] The naming of products and derivative works
• Cloudera offers CDH (Cloudera’s Distribution from other vendors and the term "compatible" are some-
including Apache Hadoop) and Cloudera what controversial within the Hadoop developer com-
Enterprise.[47] munity.[64]
5
From Wikipedia, the free encyclopedia Apache Hadoop
Papers which may be executed or re-executed on any node
in the cluster. In addition, it provides a distributed
Some papers influenced the birth and growth of Hadoop file system that stores data on the compute nodes,
and big data processing. Here is a partial list: providing very high aggregate bandwidth across
• 2004 Simplified Data Processing on Large Clusters by the cluster. Both map/reduce and the distributed
Jeffrey Dean and Sanjay Ghemawat from Google Lab. file system are designed so that node failures are
This paper inspired Doug Cutting to develop an automatically handled by the framework." Hadoop
open-source implementation of the Map-Reduce Overview
framework. He named it Hadoop, after his son’s toy [2] ^ Applications and organizations using Hadoop
elephant. [3] Hadoop Credits Page
• 2005 From Databases to Dataspaces: A New [4] Yahoo! Launches World’s Largest Hadoop
Abstraction for Information Management, the Production Application
authors highlight the need for storage systems to [5] Hadoop creator goes to Cloudera
accept all data formats and to provide APIs for data [6] Ashlee Vance (2009-03-17). "Hadoop, a Free
access that evolve based on the storage system’s Software Program, Finds Uses Beyond Search". New
understanding of the data. York Times. http://www.nytimes.com/2009/03/17/
• 2006 Bigtable: A Distributed Storage System for technology/business-computing/17cloud.html.
Structured Data from Google Lab. Retrieved 2010-01-20.
• 2008 H-store: a high-performance, distributed main [7] "Hadoop contains the distributed computing
memory transaction processing system platform that was formerly a part of Nutch. This
• 2009 MAD Skills: New Analysis Practices for Big Data includes the Hadoop Distributed Filesystem (HDFS)
• 2011 Apache Hadoop Goes Realtime at Facebook and an implementation of MapReduce." About
Hadoop
See also [8] http://hadoop.apache.org/common/docs/r0.20.2/
hdfs_user_guide.html#Rack+Awareness
• Nutch - an effort to build an open source search [9] ^ The Hadoop Distributed File System: Architecture
engine based on Lucene and Hadoop. Also created by and Design
Doug Cutting. [10] Improve Namenode startup performance. "Default
• Datameer Analytics Solution (DAS) – data source scenario for 20 million files with the max Java heap
integration, storage, analytics engine and size set to 14 GB: 40 minutes. Tuning various Java
visualization options such as young size, parallel garbage
• HBase - BigTable-model database. collection, initial Java heap size : 14 minutes"
• Hypertable - HBase alternative [11] [1] Improving MapReduce Performance through
• MapReduce - Hadoop’s fundamental data filtering Data Placement in Heterogeneous Hadoop Clusters
algorithm April 2010
• Apache Mahout - Machine Learning algorithms [12] HDFS Users Guide - Rack Awareness
implemented on Hadoop [13] ", "Cloud analytics: Do we really need to reinvent
• Apache Cassandra - A column-oriented database that the storage stack?"". IBM. 2009-06.
supports access from Hadoop http://www.usenix.org/events/hotcloud09/tech/
• HPCC - LexisNexis Risk Solutions High Performance full_papers/ananthanarayanan.pdf.
Computing Cluster [14] "HADOOP-6330: Integrating IBM General Parallel
• Sector/Sphere - Open source distributed storage and File System implementation of Hadoop Filesystem
processing interface". IBM. 2009-10-23.
• Cloud computing https://issues.apache.org/jira/browse/
• Big data HADOOP-6330.
• Data Intensive Computing [15] "HADOOP-6704: add support for Parascale
filesystem". Parascale. 2010-04-14.
References https://issues.apache.org/jira/browse/
HADOOP-6330.
[1] "Hadoop is a framework for running applications [16] "Replace HDFS with CloudIQ Storage". Appistry,Inc.
on large clusters of commodity hardware. The 2010-07-06. http://www.appistry.com/
Hadoop framework transparently provides community/wiki/display/cloudiq43/
applications both reliability and data motion. Replace+HDFS+with+CloudIQ+Storage.
Hadoop implements a computational paradigm [17] "High Availability Hadoop". HP. 2010-06-09.
named map/reduce, where the application is http://www.slideshare.net/steve_l/high-
divided into many small fragments of work, each of availability-hadoop.
6
From Wikipedia, the free encyclopedia Apache Hadoop
[18] job [42] "Creating Hadoop pe under SGE". Sun
[19] [ https://issues.apache.org/jira/browse/ Microsystems. 2008-01-16. http://blogs.sun.com/
HADOOP-3412] #HADOOP-3412 Refactor the ravee/entry/creating_hadoop_pe_under_sge.
scheduler out of the JobTracker - ASF JIRA [43] "HDFS-Aware Scheduling With Grid Engine". Sun
[20] [2] Hadoop Fair Scheduler Design Document Microsystems. 2009-09-10. http://wikis.sun.com/
[21] [3] Capacity Scheduler Guide display/SunHPC09/
[22] "How 30+ enterprises are using Hadoop", in DBMS2 Sun+HPC+Software+Workshop+’09+Wiki.
[23] Yahoo! Launches World’s Largest Hadoop [44] "OpenSolaris Project: Hadoop Live CD". Sun
Production Application (Hadoop and Distributed Microsystems. 2008-08-29. http://opensolaris.org/
Computing at Yahoo!) os/project/livehadoop/.
[24] Hadoop and Distributed Computing at Yahoo! [45] "Condor integrated with Hadoop’s Map Reduce".
[25] [4] University of Wisconsin–Madison. 2010-04-15.
[26] [5] http://www.cs.wisc.edu/condor/
[27] "Apple Embraces Hadoop". CondorWeek2010/condor-presentations/thain-
http://www.theregister.co.uk/2010/12/01/ condor-hadoop.pdf.
apple_embraces_hadoop/. Retrieved 2011-04-14. [46] Why the Pace of Hadoop Innovation Has to Pick Up
[28] "Using Hadoop to tackle Big Data at comScore". [47] Cloudera’s Distribution including Apache Hadoop
http://www.cloudera.com/videos/ [48] IBM InfoSphere BigInsights
[49] IBM
hw10_video_using_hadoop_to_tackle_big_data_at_comscore. InfoSphere BigInsights Enterprise Edition
[29] "Building a terabyte-scale data cycle at LinkedIn analytics platform enables new class of solutions
with Hadoop and Project Voldemort". for gaining rapid insight through large-scale
http://project-voldemort.com/blog/2009/06/ analysis of diverse data
building-a-1-tb-data-cycle-at-linkedin-with- [50] Platform Computing Announces Support for
hadoop-and-project-voldemort/. Retrieved MapReduce
2011-04-14. [51] MapR Distribution for Apache Hadoop
[30] "Microsoft Expands Data Platform With SQL Server [52] http://mapr.com/products/mapr-editions/
2012, New Investments for Managing Any Data, m5-edition.html
Any Size, Anywhere". http://www.microsoft.com/ [53] Hadoop optimized solutions from SGI
Presspass/press/2011/oct11/10-12PASS1PR.mspx. [54] Greenplum Community
Retrieved 2011-10-13. [55] Greenplum HD: Enterprise-Ready Apache Hadoop
[31] "Use Case Study of Hive/Hadoop". [56] Yahoo! and Benchmark Capital to Form
http://www.slideshare.net/evamtse/hive-user- Hortonworks to Increase Investment in Hadoop
group-presentation-from-netflix-3182010-3483386. Technology and Accelerate Innovation and
Retrieved 2011-04-14. Adoption
[32] "HBase at StumbleUpon". [57] appengine-mapreduce - Google App Engine API for
http://www.stumbleupon.com/devblog/ running MapReduce jobs
hbase_at_stumbleupon/. Retrieved 2010-06-26. [58] Google I/O 2011: App Engine MapReduce on
[33] "Mitula Search/Hadoop". YouTube
http://www.mitula.co.uk. Retrieved 2011-09-06. [59] Oracle Unveils the Oracle Big Data Appliance
[34] http://aws.typepad.com/aws/2008/02/taking- [60] Oracle rolls its own NoSQL and Hadoop
massive.html Running Hadoop on Amazon EC2/S3 [61] http://www.oceansync.com OceanSync.com
[35] Gottfrid, Derek (November 1, 2007). "Self-service, Hadoop Management
Prorated Super Computing Fun!". The New York [62] http://www.grandlogic.com/content/html_docs/
Times. http://open.blogs.nytimes.com/2007/11/ js_features.shtml
01/self-service-prorated-super-computing-fun/ [63] Defining Hadoop
?scp=1&sq=self%20service%20prorated&st=cse. [64] Defining Hadoop Compatibility: revisited
Retrieved May 4, 2010.
[36] Amazon Elastic MapReduce Developer Guide
[37] Amazon Elastic MapReduce Now Supports Spot
Bibliography
Instances • Lam, Chuck (July 28, 2010). Hadoop in Action (1st ed.).
[38] Amazon Elastic MapReduce FAQs Manning Publications. p. 325. ISBN 1-935-18219-6.
[39] Using Spot Instances with EMR on YouTube • Venner, Jason (June 22, 2009). Pro Hadoop (1st ed.).
[40] Google Press Center: Google and IBM Announce Apress. p. 440. ISBN 1-430-21942-4.
University Initiative to Address Internet-Scale http://www.apress.com/book/view/1430219424.
Computing Challenges • White, Tom (June 16, 2009). Hadoop: The Definitive
[41] NSF, Google, IBM form CLuE Guide (1st ed.). O’Reilly Media. p. 524.
7
From Wikipedia, the free encyclopedia Apache Hadoop
ISBN 0-596-52197-9. http://oreilly.com/catalog/ • Introducing Apache Hadoop: The Modern Data
9780596521974. Operating System — lecture given at Stanford
University by Co-Founder and CTO of Cloudera, Amr
External links Awadallah (video archive).
• Official Hadoop Homepage
Retrieved from "http://en.wikipedia.org/w/index.php?title=Apache_Hadoop&oldid=468610972"
Categories:
• Hadoop
• Free software programmed in Java
• Free system software
• Distributed file systems
• Cloud computing
• Cloud infrastructure
This page was last modified on 30 December 2011 at 15:35. Text is available under the Creative Commons Attribution-
ShareAlike License; additional terms may apply. See Terms of use for details. Wikipedia® is a registered trademark of
the Wikimedia Foundation, Inc., a non-profit organization.Contact us
Privacy policy About Wikipedia Disclaimers
8