From Wikipedia, the free encyclopedia Apache Hadoop
Apache Hadoop
Apache Hadoop information to run work on the node where the data is,
and, failing that, on the same rack/switch, so reducing
backbone traffic. The Hadoop Distributed File System
(HDFS) uses this when replicating data, to try to keep dif-
Developer(s) Apache Software Foundation ferent copies of the data on different racks. The goal is to
reduce the impact of a rack power outage or switch fail-
Stable release 1.0.0 / December 27,
ure so that even if these events occur, the data may still
2011 (2011-12-27)
be readable.[8]
Preview release 0.22.0 / December 10,
2011 (2011-12-10)
Development Active
status
Written in Java
Operating system Cross-platform
Type Distributed File System
License Apache License 2.0
Website hadoop.apache.org
Apache Hadoop is a software framework that supports
data-intensive distributed applications under a free li-
cense.[1] It enables applications to work with thousands A multi-node Hadoop cluster
of nodes and petabytes of data. Hadoop was inspired by
Google’s MapReduce and Google File System (GFS) pa- A small Hadoop cluster will include a single master
pers. and multiple worker nodes. The master node consists
Hadoop is a top-level Apache project being built and of a JobTracker, TaskTracker, NameNode, and DataNode.
used by a global community of contributors,[2] written A slave or worker node acts as both a DataNode and
in the Java programming language. Yahoo! has been the TaskTracker, though it is possible to have data-only
largest contributor[3] to the project, and uses Hadoop ex- worker nodes, and compute-only worker nodes; these
tensively across its businesses.[4] are normally only used in non-standard applications.
Hadoop was created by Doug Cutting,[5] who named it Hadoop requires JRE 1.6 or higher. The standard startup
after his son’s toy elephant.[6] It was originally developed and shutdown scripts require ssh to be set up between
to support distribution for the Nutch search engine pro- nodes in the cluster.
ject.[7] In a larger cluster, the HDFS is managed through a
dedicated NameNode server to host the filesystem index,
Architecture and a secondary NameNode that can generate snapshots
of the namenode’s memory structures, thus preventing
Hadoop consists of the Hadoop Common, which provides filesystem corruption and reducing loss of data. Similar-
access to the filesystems supported by Hadoop. The ly, a standalone JobTracker server can manage job sched-
Hadoop Common package contains the necessary JAR uling. In clusters where the Hadoop MapReduce engine is
files and scripts needed to start Hadoop. The package also deployed against an alternate filesystem, the NameNode,
provides source code, documentation, and a contribution secondary NameNode and DataNode architecture of
section which includes projects from the Hadoop Com- HDFS is replaced by the filesystem-specific equivalent.
munity.
For effective scheduling of work, every Hadoop-com-
patible filesystem should provide location awareness: the
name of the rack (more precisely, of the network switch)
where a worker node is. Hadoop applications can use this
1
From Wikipedia, the free encyclopedia Apache Hadoop
Filesystems data (a,b,c). The jobtracker will schedule node B to per-
form map/reduce tasks on (a,b,c) and node A would be
Hadoop Distributed File System scheduled to perform map/reduce tasks on (x,y,z). This
HDFS is a distributed, scalable, and portable filesystem reduces the amount of traffic that goes over the network
written in Java for the Hadoop framework. Each node in and prevents unnecessary data transfer. When Hadoop
a Hadoop instance typically has a single datanode; a clus- is used with other filesystems this advantage is not al-
ter of datanodes form the HDFS cluster. The situation is ways available. This can have a significant impact on the
typical because each node does not require a datanode performance of job completion times, which has been
to be present. Each datanode serves up blocks of data demonstrated when running data intensive jobs.[11]
over the network using a block protocol specific to HDFS. Another limitation of HDFS is that it cannot be direct-
The filesystem uses the TCP/IP layer for communication; ly mounted by an existing operating system. Getting data
clients use RPC to communicate between each other. into and out of the HDFS file system, an action that often
HDFS stores large files (an ideal file size is a multiple of 64 needs to be performed before and after executing a job,
MB[9]), across multiple machines. It achieves reliability can be inconvenient. A Filesystem in Userspace (FUSE)
by replicating the data across multiple hosts, and hence virtual file system has been developed to address this
does not require RAID storage on hosts. With the default problem, at least for Linux and some other Unix systems.
replication value, 3, data is stored on three nodes: two on File access can be achieved through the native Java
the same rack, and one on a different rack. Data nodes API, the Thrift API to generate a client in the language
can talk to each other to rebalance data, to move copies of the users’ choosing (C++, Java, Python, PHP, Ruby, Er-
around, and to keep the replication of data high. HDFS lang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the
is not fully POSIX compliant because the requirements command-line interface, or browsed through the HDFS-
for a POSIX filesystem differ from the target goals for a UI webapp over HTTP.
Hadoop application. The tradeoff of not having a fully
Other Filesystems
POSIX compliant filesystem is increased performance for
data throughput. HDFS was designed to handle very large By May 2011, the list of supported filesystems included:
files.[9] • HDFS: Hadoop’s own rack-aware filesystem.[12] This
HDFS does not provide high availability, because an is designed to scale to tens of petabytes of storage
HDFS filesystem instance requires one unique server, the and runs on top of the filesystems of the underlying
name node. This is a single point of failure for an HDFS in- operating systems.
stallation. If the name node goes down, the filesystem is • Amazon S3 filesystem. This is targeted at clusters
offline. When it comes back up, the name node must re- hosted on the Amazon Elastic Compute Cloud server-
play all outstanding operations. This replay process can on-demand infrastructure. There is no rack-
take over half an hour for a big cluster.[10] The filesystem awareness in this filesystem, as it is all remote.
includes what is called a Secondary Namenode, which mis- • CloudStore (previously Kosmos Distributed File
leads some people into thinking that when the Primary System), which is rack-aware.
Namenode goes offline, the Secondary Namenode takes • FTP Filesystem: this stores all its data on remotely
over. In fact, the Secondary Namenode regularly con- accessible FTP servers.
nects with the Primary Namenode and builds snapshots • Read-only HTTP and HTTPS file systems.
of the Primary Namenode’s directory information, which Hadoop can work directly with any distributed file sys-
is then saved to local/remote directories. These check- tem that can be mounted by the underlying operating
pointed images can be used to restart a failed Primary system simply by using a file:// URL; however, this comes
Namenode without having to replay the entire journal at a price: the loss of locality. To reduce network traffic,
of filesystem actions, then edit the log to create an up- Hadoop needs to know which servers are closest to the
to-date directory structure. Since Namenode is the single data; this is information which Hadoop-specific filesys-
point for storage and management of metadata, this can tem bridges can provide.
be a bottleneck for supporting huge number of files, es- Out-of-the-box, this includes Amazon S3, and the
pecially large number of small files. HDFS Federation is a CloudStore filestore, through s3:// and kfs:// URLs di-
new addition which aims to tackle this problem to a cer- rectly.
tain extent by allowing multiple namespaces served by A number of third party filesystem bridges have also
separate Namenodes . been written, none of which are currently in Hadoop dis-
An advantage of using HDFS is data awareness be- tributions. These may offer superior availability or scal-
tween the jobtracker and tasktracker. The jobtracker ability, and possibly a more general-purpose filesystem
schedules map/reduce jobs to tasktrackers with an than HDFS, which is biased towards large files and only
awareness of the data location. An example of this would offers a subset of the expected semantics of a Posix
be if node A contained data (x,y,z) and node B contained Filesystem: no locking, or writing to anywhere other
than the tail of a file.
2
From Wikipedia, the free encyclopedia Apache Hadoop
• In 2009 IBM discussed running Hadoop over the IBM • If one TaskTracker is very slow, it can delay the
General Parallel File System.[13] The source code was entire MapReduce job - especially towards the end of
published in October 2009.[14] a job, where everything can end up waiting for the
• In April 2010, Parascale published the source code to slowest task. With speculative-execution enabled,
run Hadoop against the Parascale filesystem.[15] however, a single task can be executed on multiple
• In April 2010, Appistry released a Hadoop filesystem slave nodes.
driver for use with its own CloudIQ Storage
product.[16] Scheduling
• In June 2010, HP discussed a location-aware IBRIX By default Hadoop uses FIFO, and optional 5 scheduling
Fusion filesystem driver.[17] priorities to schedule jobs from a work queue.[18] In ver-
• In May 2011, MapR Technologies, Inc. announced the sion 0.19 the job scheduler was refactored out of the
availability of an alternate filesystem for Hadoop, JobTracker, and added the ability to use an alternate
which replaced the HDFS file system with a full scheduler (such as the Fair scheduler or the Capacity sched-
random-access read/write file system, with uler).[19]
advanced features like snaphots and mirrors, and get
Fair scheduler
rid of the single point of failure issue of the default
The fair scheduler was developed by Facebook. The goal
HDFS NameNode.
of the fair scheduler is to provide fast response times for
small jobs and QoS for production jobs. The fair scheduler
JobTracker and TaskTracker: the
has three basic concepts.[20]
MapReduce engine 1. Jobs are grouped into Pools.
Above the file systems comes the MapReduce engine, 2. Each pool is assigned a guaranteed minimum share.
which consists of one JobTracker, to which client appli- 3. Excess capacity is split between jobs.
cations submit MapReduce jobs. The JobTracker pushes By default jobs that are uncategorized go into a default
work out to available TaskTracker nodes in the cluster, pool. Pools have to specify the minimum number of map
striving to keep the work as close to the data as possible. slots, reduce slots, and a limit on the number of running
With a rack-aware filesystem, the JobTracker knows jobs.
which node contains the data, and which other machines
Capacity scheduler
are nearby. If the work cannot be hosted on the actual
The capacity scheduler was developed by Yahoo. The ca-
node where the data resides, priority is given to nodes in
pacity scheduler supports several features which are
the same rack. This reduces network traffic on the main
similar to the fair scheduler.[21]
backbone network. If a TaskTracker fails or times out,
• Jobs are submitted into queues.
that part of the job is rescheduled. The TaskTracker on
• Queues are allocated a fraction of the total resource
each node spawns off a separate Java Virtual Machine
capacity.
process to prevent the TaskTracker itself from failing if
• Free resources are allocated to queues beyond their
the running job crashes the JVM. A heartbeat is sent from
total capacity.
the TaskTracker to the JobTracker every few minutes to
• Within a queue a job with a high level of priority will
check its status. The Job Tracker and TaskTracker status
have access to the queue’s resources.
and information is exposed by Jetty and can be viewed
There is no preemption once a job is running.
from a web browser.
If the JobTracker failed on Hadoop 0.20 or earlier, all
ongoing work was lost. Hadoop version 0.21 added some
Other applications
checkpointing to this process; the JobTracker records The HDFS filesystem is not restricted to MapReduce jobs.
what it is up to in the filesystem. When a JobTracker It can be used for other applications, many of which are
starts up, it looks for any such data, so that it can restart under development at Apache. The list includes the
work from where it left off. In earlier versions of Hadoop, HBase database, the Apache Mahout machine learning
all active work was lost when a JobTracker restarted. system, and the Apache Hive Data Warehouse system.
Known limitations of this approach are: Hadoop can in theory be used for any sort of work that
• The allocation of work to TaskTrackers is very is batch-oriented rather than real-time, that is very data-
simple. Every TaskTracker has a number of available intensive, and able to work on pieces of the data in par-
slots (such as "4 slots"). Every active map or reduce allel. As of October 2009, commercial applications of
task takes up one slot. The Job Tracker allocates Hadoop[22] included:
work to the tracker nearest to the data with an • Log and/or clickstream analysis of various kinds
available slot. There is no consideration of the • Marketing analytics
current system load of the allocated machine, and • Machine learning and/or sophisticated data mining
hence its actual availability. • Image processing
3
From Wikipedia, the free encyclopedia Apache Hadoop
• Processing of XML messages • Freebase
• Web crawling and/or text processing • Hewlett-Packard
• General archiving, including of relational/tabular • IBM
data, e.g. for compliance • InMobi [29]
• ImageShack
Prominent users •
•
ISI
Joost
• Last.fm
Yahoo! • LinkedIn[30]
On February 19, 2008, Yahoo! Inc. launched what it • Microsoft[31]
claimed was the world’s largest Hadoop production ap- • Meebo
plication. The Yahoo! Search Webmap is a Hadoop appli- • Mendeley
cation that runs on more than 10,000 core Linux cluster • Metaweb
and produces data that is now used in every Yahoo! Web • Netflix[32]
search query.[23] • The New York Times
There are multiple Hadoop clusters at Yahoo!, and • Ning
no HDFS filesystems or MapReduce jobs are split across • Outbrain
multiple datacenters. Every hadoop cluster node boot- • Playdom (now part of Disney Interactive Media
straps the Linux image, including the Hadoop distribu- Group)
tion. Work that the clusters perform is known to include • Powerset (now part of Microsoft)
the index calculations for the Yahoo! search engine. • Rackspace
On June 10, 2009, Yahoo! made available the source • Razorfish
code to the version of Hadoop it runs in production.[24] • StumbleUpon[33]
Yahoo! contributes back all work it does on Hadoop to • Twitter
the open-source community, the company’s developers • Mitula[34]
also fix bugs and provide stability improvements inter-
nally, and release this patched source code so that other
users may benefit from their effort.
Hadoop on Amazon EC2/S3 ser-
vices
Facebook
It is possible to run Hadoop on Amazon Elastic Compute
In 2010 Facebook claimed that they have the largest
Cloud (EC2) and Amazon Simple Storage Service (S3).[35]
Hadoop cluster in the world with 21 PB of storage.[25] On
As an example The New York Times used 100 Amazon
July 27, 2011 they announced the data has grown to 30
EC2 instances and a Hadoop application to process 4 TB
PB.[26]
of raw image TIFF data (stored in S3) into 11 million fin-
ished PDFs in the space of 24 hours at a computation cost
Other users of about $240 (not including bandwidth).[36]
Besides Facebook and Yahoo!, many other organizations There is support for the S3 filesystem in Hadoop dis-
are using Hadoop to run large distributed computations. tributions, and the Hadoop team generates EC2 machine
Some of the notable users include:[2] images after every release. From a pure performance per-
• 1&1 spective, Hadoop on S3/EC2 is inefficient, as the S3
• A9.com filesystem is remote and delays returning from every
• About.com write operation until the data is guaranteed not to be
• Amazon.com lost. This removes the locality advantages of Hadoop,
• American Airlines which schedules work near data to save on network load.
• AOL
• Apple[27] Amazon Elastic MapReduce
• Booz Allen Hamilton
Elastic MapReduce was introduced by Amazon in April
• Cerner
2009. Provisioning of the Hadoop cluster, running and
• ChaCha
terminating jobs, and handling data transfer between
• comScore[28]
EC2 and S3 are automated by Elastic MapReduce. Apache
• EHarmony
Hive, which is built on top of Hadoop for providing data
• eBay
warehouse services, is also offered in Elastic MapRe-
• Federal Reserve Board of Governors
duce.[37]
• foursquare
Support for using Spot Instances was later added in
• Fox Interactive Media
August 2011.[38] Elastic MapReduce is fault tolerant for
4
From Wikipedia, the free encyclopedia Apache Hadoop
slave failures,[39] and it is recommended to only run the • Cloudera offers CDH (Cloudera’s Distribution
Task Instance Group on spot instances to take advantage including Apache Hadoop) and Cloudera
of the lower cost while maintaining availability. [40] Enterprise.[48]
• IBM offers InfoSphere BigInsights[49] based on
Hadoop at Google and IBM Hadoop in both a basic and enterprise edition.[50]
• Zettaset offers new version of it’s Big Data Mgt
IBM and Google announced an initiative in 2007 to use Platform[51] based on Hadoop Zettaset’s Big Data
Hadoop to support university courses in distributed com- Platform delivers High Availability via NameNode
puter programming.[41] Failover, a streamlined UI, network Time Protocol
In 2008 this collaboration, the Academic Cloud Com- and built in security via Kerberos Authentication
puting Initiative (ACCI), partnered with the National • In March 2011, Platform Computing announced
Science Foundation to provide grant funding to academic support for the Hadoop MapReduce API in its
researchers interested in exploring large-data applica- Symphony software.[52]
tions. This resulted in the creation of the Cluster Ex- • In May 2011, MapR Technologies, Inc. announced the
ploratory (CLuE) program.[42] availability of their distributed filesystem and
MapReduce engine, the MapR Distribution for
Running Hadoop in compute Apache Hadoop.[53] The MapR product includes most
Hadoop eco-system components and adds
farm environments capabilities such as snapshots, mirrors, NFS access
and full read-write file semantics.[54]
Hadoop can also be used in compute farms and high-per-
• Silicon Graphics International offers Hadoop
formance computing environments. Instead of setting up
optimized solutions based on the SGI Rackable and
a dedicated Hadoop cluster, an existing compute farm
CloudRack server lines with implementation
can be used if the resource manager of the cluster is
services.[55]
aware of the Hadoop jobs, and thus Hadoop jobs can be
• EMC released EMC Greenplum Community Edition and
scheduled like other jobs in the cluster.
EMC Greenplum HD Enterprise Edition in May 2011. The
community edition, with optional for-fee technical
Grid Engine Integration support, consists of Hadoop, HDFS, HBase, Hive, and
Integration with Sun Grid Engine was released in 2008, the ZooKeeper configuration service. The enterprise
and running Hadoop on Sun Grid (Sun’s on-demand util- edition is an offering based on the MapR product,
ity computing service) was possible.[43] In the initial im- and offers proprietary features such as snapshots
plementation of the integration, the CPU-time scheduler and wide area replication.[56][57]
has no knowledge of the locality of the data. Unfortu- • In June 2011, Yahoo! and Benchmark Capital formed
nately, this means that the processing is not always done Hortonworks Inc., whose focus is on making Hadoop
on the same rack as the data; this was a key feature more robust and easier to install, manage and use for
of the Hadoop Runtime. An improved integration with enterprise users.[58]
data-locality was announced during the Sun HPC Soft- • Google added AppEngine-MapReduce to support
ware Workshop ’09.[44] running Hadoop 0.20 programs on Google App
In 2008-2009 Sun released the Hadoop Live CD OpenSo- Engine.[59][60]
laris project, which allows running a fully functional • In Oct 2011, Oracle announced the Big Data Appliance,
Hadoop cluster using a live CD.[45] This distribution in- which integrates Hadoop, Oracle Enterprise Linux,
cludes Hadoop 0.19 -as of April 2010 there has not been the R programming language, and a NoSQL database
an updated release. with the Exadata hardware.[61][62]
• Dovestech has released Ocean Sync Hadoop
Condor Integration Management Software Freeware Edition. The
The Condor High-Throughput Computing System inte- software allows users to control and monitor all
gration was presented at the Condor Week conference in aspects of an Hadoop cluster.[63]
2010.[46] • Grand Logic’s JobServer[64] product allows
developers and admins to deploy, manage and
Commercially supported monitor their Hadoop infrastructure, with support
for Hadoop job processing and HDFS file/content
Hadoop-related products management.
There are a number of companies offering commercial
implementations and/or providing support for
Hadoop.[47]
5
From Wikipedia, the free encyclopedia Apache Hadoop
ASF’s view on the use of "Hadoop" in • Data Intensive Computing
product names
The Apache Software Foundation has stated that only References
software officially released by the Apache Hadoop Pro- [1] "Hadoop is a framework for running applications
ject can be called Apache Hadoop or Distributions of Apache on large clusters of commodity hardware. The
Hadoop.[65] The naming of products and derivative works Hadoop framework transparently provides
from other vendors and the term "compatible" are some- applications both reliability and data motion.
what controversial within the Hadoop developer com- Hadoop implements a computational paradigm
munity.[66] named map/reduce, where the application is
divided into many small fragments of work, each of
Papers which may be executed or re-executed on any node
in the cluster. In addition, it provides a distributed
Some papers influenced the birth and growth of Hadoop file system that stores data on the compute nodes,
and big data processing. Here is a partial list: providing very high aggregate bandwidth across
• 2004 MapReduce: Simplified Data Processing on the cluster. Both map/reduce and the distributed
Large Clusters by Jeffrey Dean and Sanjay Ghemawat file system are designed so that node failures are
from Google Lab. This paper inspired Doug Cutting to automatically handled by the framework." Hadoop
develop an open-source implementation of the Map- Overview
Reduce framework. He named it Hadoop, after his [2] ^ Applications and organizations using Hadoop
son’s toy elephant. [3] Hadoop Credits Page
• 2005 From Databases to Dataspaces: A New [4] Yahoo! Launches World’s Largest Hadoop
Abstraction for Information Management, the Production Application
authors highlight the need for storage systems to [5] Hadoop creator goes to Cloudera
accept all data formats and to provide APIs for data [6] Ashlee Vance (2009-03-17). "Hadoop, a Free
access that evolve based on the storage system’s Software Program, Finds Uses Beyond Search". New
understanding of the data. York Times. http://www.nytimes.com/2009/03/17/
• 2006 Bigtable: A Distributed Storage System for technology/business-computing/17cloud.html.
Structured Data from Google Lab. Retrieved 2010-01-20.
• 2008 H-store: a high-performance, distributed main [7] "Hadoop contains the distributed computing
memory transaction processing system platform that was formerly a part of Nutch. This
• 2009 MAD Skills: New Analysis Practices for Big Data includes the Hadoop Distributed Filesystem (HDFS)
• 2011 Apache Hadoop Goes Realtime at Facebook and an implementation of MapReduce." About
Hadoop
See also [8] http://hadoop.apache.org/common/docs/r0.20.2/
hdfs_user_guide.html#Rack+Awareness
• Nutch - an effort to build an open source search [9] ^ The Hadoop Distributed File System: Architecture
engine based on Lucene and Hadoop. Also created by and Design
Doug Cutting. [10] Improve Namenode startup performance. "Default
• Datameer Analytics Solution (DAS) – data source scenario for 20 million files with the max Java heap
integration, storage, analytics engine and size set to 14 GB: 40 minutes. Tuning various Java
visualization options such as young size, parallel garbage
• HBase - BigTable-model database. collection, initial Java heap size : 14 minutes"
• Hypertable - HBase alternative [11] [1] Improving MapReduce Performance through
• MapReduce - Hadoop’s fundamental data filtering Data Placement in Heterogeneous Hadoop Clusters
algorithm April 2010
• Apache Mahout - Machine Learning algorithms [12] HDFS Users Guide - Rack Awareness
implemented on Hadoop [13] ", "Cloud analytics: Do we really need to reinvent
• Apache Cassandra - A column-oriented database that the storage stack?"". IBM. 2009-06.
supports access from Hadoop http://www.usenix.org/events/hotcloud09/tech/
• HPCC - LexisNexis Risk Solutions High Performance full_papers/ananthanarayanan.pdf.
Computing Cluster [14] "HADOOP-6330: Integrating IBM General Parallel
• Sector/Sphere - Open source distributed storage and File System implementation of Hadoop Filesystem
processing interface". IBM. 2009-10-23.
• Cloud computing https://issues.apache.org/jira/browse/
• Big data HADOOP-6330.
6
From Wikipedia, the free encyclopedia Apache Hadoop
[15] "HADOOP-6704: add support for Parascale [35] http://aws.typepad.com/aws/2008/02/taking-
filesystem". Parascale. 2010-04-14. massive.html Running Hadoop on Amazon EC2/S3
https://issues.apache.org/jira/browse/ [36] Gottfrid, Derek (November 1, 2007). "Self-service,
HADOOP-6704. Prorated Super Computing Fun!". The New York
[16] "Replace HDFS with CloudIQ Storage". Appistry,Inc. Times. http://open.blogs.nytimes.com/2007/11/
2010-07-06. http://www.appistry.com/ 01/self-service-prorated-super-computing-fun/
community/wiki/display/cloudiq43/ ?scp=1&sq=self%20service%20prorated&st=cse.
Replace+HDFS+with+CloudIQ+Storage. Retrieved May 4, 2010.
[17] "High Availability Hadoop". HP. 2010-06-09. [37] Amazon Elastic MapReduce Developer Guide
http://www.slideshare.net/steve_l/high- [38] Amazon Elastic MapReduce Now Supports Spot
availability-hadoop. Instances
[18] job [39] Amazon Elastic MapReduce FAQs
[19] [ https://issues.apache.org/jira/browse/ [40] Using Spot Instances with EMR on YouTube
HADOOP-3412] #HADOOP-3412 Refactor the [41] Google Press Center: Google and IBM Announce
scheduler out of the JobTracker - ASF JIRA University Initiative to Address Internet-Scale
[20] [2] Hadoop Fair Scheduler Design Document Computing Challenges
[21] [3] Capacity Scheduler Guide [42] NSF, Google, IBM form CLuE
[22] "How 30+ enterprises are using Hadoop", in DBMS2 [43] "Creating Hadoop pe under SGE". Sun
[23] Yahoo! Launches World’s Largest Hadoop Microsystems. 2008-01-16. http://blogs.sun.com/
Production Application (Hadoop and Distributed ravee/entry/creating_hadoop_pe_under_sge.
Computing at Yahoo!) [44] "HDFS-Aware Scheduling With Grid Engine". Sun
[24] Hadoop and Distributed Computing at Yahoo! Microsystems. 2009-09-10. http://wikis.sun.com/
[25] [4] display/SunHPC09/
[26] [5] Sun+HPC+Software+Workshop+’09+Wiki.
[27] "Apple Embraces Hadoop". [45] "OpenSolaris Project: Hadoop Live CD". Sun
http://www.theregister.co.uk/2010/12/01/ Microsystems. 2008-08-29. http://opensolaris.org/
apple_embraces_hadoop/. Retrieved 2011-04-14. os/project/livehadoop/.
[28] "Using Hadoop to tackle Big Data at comScore". [46] "Condor integrated with Hadoop’s Map Reduce".
http://www.cloudera.com/videos/ University of Wisconsin–Madison. 2010-04-15.
hw10_video_using_hadoop_to_tackle_big_data_at_comscore. http://www.cs.wisc.edu/condor/
[29] "InMobi Ranked as a Top 10 Contributor to Apache CondorWeek2010/condor-presentations/thain-
Hadoop". http://www.inmobi.com/inmobiblog/ condor-hadoop.pdf.
2011/10/07/inmobi-ranked-as-a- [47] Why the Pace of Hadoop Innovation Has to Pick Up
top-10-contributor-to-apache-hadoop/. Retrieved [48] Cloudera’s Distribution including Apache Hadoop
2011-10-07. [49] IBM InfoSphere BigInsights
[30] "Building a terabyte-scale data cycle at LinkedIn [50] IBM InfoSphere BigInsights Enterprise Edition
with Hadoop and Project Voldemort". analytics platform enables new class of solutions
http://project-voldemort.com/blog/2009/06/ for gaining rapid insight through large-scale
building-a-1-tb-data-cycle-at-linkedin-with- analysis of diverse data
hadoop-and-project-voldemort/. Retrieved [51] [6]
2011-04-14. [52] Platform Computing Announces Support for
[31] "Microsoft Expands Data Platform With SQL Server MapReduce
2012, New Investments for Managing Any Data, [53] MapR Distribution for Apache Hadoop
Any Size, Anywhere". http://www.microsoft.com/ [54] http://mapr.com/products/mapr-editions/
Presspass/press/2011/oct11/10-12PASS1PR.mspx. m5-edition.html
Retrieved 2011-10-13. [55] Hadoop optimized solutions from SGI
[32] "Use Case Study of Hive/Hadoop". [56] Greenplum Community
http://www.slideshare.net/evamtse/hive-user- [57] Greenplum HD: Enterprise-Ready Apache Hadoop
group-presentation-from-netflix-3182010-3483386. [58] Yahoo! and Benchmark Capital to Form
Retrieved 2011-04-14. Hortonworks to Increase Investment in Hadoop
[33] "HBase at StumbleUpon". Technology and Accelerate Innovation and
http://www.stumbleupon.com/devblog/ Adoption
hbase_at_stumbleupon/. Retrieved 2010-06-26. [59] appengine-mapreduce - Google App Engine API for
[34] "Mitula Search/Hadoop". running MapReduce jobs
http://www.mitula.co.uk. Retrieved 2011-09-06. [60] Google I/O 2011: App Engine MapReduce on
YouTube
7
From Wikipedia, the free encyclopedia Apache Hadoop
[61] Oracle Unveils the Oracle Big Data Appliance • White, Tom (June 16, 2009). Hadoop: The Definitive
[62] Oracle rolls its own NoSQL and Hadoop Guide (1st ed.). O’Reilly Media. p. 524.
[63] http://www.oceansync.com OceanSync.com ISBN 0-596-52197-9. http://oreilly.com/catalog/
Hadoop Management 9780596521974.
[64] http://www.grandlogic.com/content/html_docs/ • Holmes, Alex (Softbound print: Fall 2012). Hadoop In
js_features.shtml Practice (1st ed.). Manning Publications. p. 425.
[65] Defining Hadoop ISBN 9781617290237. http://www.manning.com/
[66] Defining Hadoop Compatibility: revisited holmes/.
Bibliography External links
• Lam, Chuck (July 28, 2010). Hadoop in Action (1st ed.). • Official Hadoop Homepage
Manning Publications. p. 325. ISBN 1-935-18219-6. • Introducing Apache Hadoop: The Modern Data
• Venner, Jason (June 22, 2009). Pro Hadoop (1st ed.). Operating System — lecture given at Stanford
Apress. p. 440. ISBN 1-430-21942-4. University by Co-Founder and CTO of Cloudera, Amr
http://www.apress.com/book/view/1430219424. Awadallah (video archive).
Retrieved from "http://en.wikipedia.org/w/index.php?title=Apache_Hadoop&oldid=473736623"
Categories:
• Hadoop
• Free software programmed in Java
• Free system software
• Distributed file systems
• Cloud computing
• Cloud infrastructure
This page was last modified on 28 January 2012 at 20:00. Text is available under the Creative Commons Attribution-
ShareAlike License; additional terms may apply. See Terms of use for details. Wikipedia® is a registered trademark of
the Wikimedia Foundation, Inc., a non-profit organization.Contact us
Privacy policy About Wikipedia Disclaimers
8