Embed
Email

Apache_Hadoop

Document Sample

Shared by: roy ashbrook
Categories
Tags
Stats
views:
0
posted:
1/31/2012
language:
pages:
8
From Wikipedia, the free encyclopedia Apache Hadoop









Apache Hadoop

Apache Hadoop information to run work on the node where the data is,

and, failing that, on the same rack/switch, so reducing

backbone traffic. The Hadoop Distributed File System

(HDFS) uses this when replicating data, to try to keep dif-

Developer(s) Apache Software Foundation ferent copies of the data on different racks. The goal is to

reduce the impact of a rack power outage or switch fail-

Stable release 1.0.0 / December 27,

ure so that even if these events occur, the data may still

2011 (2011-12-27)

be readable.[8]

Preview release 0.22.0 / December 10,

2011 (2011-12-10)



Development Active

status



Written in Java



Operating system Cross-platform



Type Distributed File System



License Apache License 2.0



Website hadoop.apache.org



Apache Hadoop is a software framework that supports

data-intensive distributed applications under a free li-

cense.[1] It enables applications to work with thousands A multi-node Hadoop cluster

of nodes and petabytes of data. Hadoop was inspired by

Google’s MapReduce and Google File System (GFS) pa- A small Hadoop cluster will include a single master

pers. and multiple worker nodes. The master node consists

Hadoop is a top-level Apache project being built and of a JobTracker, TaskTracker, NameNode, and DataNode.

used by a global community of contributors,[2] written A slave or worker node acts as both a DataNode and

in the Java programming language. Yahoo! has been the TaskTracker, though it is possible to have data-only

largest contributor[3] to the project, and uses Hadoop ex- worker nodes, and compute-only worker nodes; these

tensively across its businesses.[4] are normally only used in non-standard applications.

Hadoop was created by Doug Cutting,[5] who named it Hadoop requires JRE 1.6 or higher. The standard startup

after his son’s toy elephant.[6] It was originally developed and shutdown scripts require ssh to be set up between

to support distribution for the Nutch search engine pro- nodes in the cluster.

ject.[7] In a larger cluster, the HDFS is managed through a

dedicated NameNode server to host the filesystem index,

Architecture and a secondary NameNode that can generate snapshots

of the namenode’s memory structures, thus preventing

Hadoop consists of the Hadoop Common, which provides filesystem corruption and reducing loss of data. Similar-

access to the filesystems supported by Hadoop. The ly, a standalone JobTracker server can manage job sched-

Hadoop Common package contains the necessary JAR uling. In clusters where the Hadoop MapReduce engine is

files and scripts needed to start Hadoop. The package also deployed against an alternate filesystem, the NameNode,

provides source code, documentation, and a contribution secondary NameNode and DataNode architecture of

section which includes projects from the Hadoop Com- HDFS is replaced by the filesystem-specific equivalent.

munity.

For effective scheduling of work, every Hadoop-com-

patible filesystem should provide location awareness: the

name of the rack (more precisely, of the network switch)

where a worker node is. Hadoop applications can use this







1

From Wikipedia, the free encyclopedia Apache Hadoop





Filesystems data (a,b,c). The jobtracker will schedule node B to per-

form map/reduce tasks on (a,b,c) and node A would be

Hadoop Distributed File System scheduled to perform map/reduce tasks on (x,y,z). This

HDFS is a distributed, scalable, and portable filesystem reduces the amount of traffic that goes over the network

written in Java for the Hadoop framework. Each node in and prevents unnecessary data transfer. When Hadoop

a Hadoop instance typically has a single datanode; a clus- is used with other filesystems this advantage is not al-

ter of datanodes form the HDFS cluster. The situation is ways available. This can have a significant impact on the

typical because each node does not require a datanode performance of job completion times, which has been

to be present. Each datanode serves up blocks of data demonstrated when running data intensive jobs.[11]

over the network using a block protocol specific to HDFS. Another limitation of HDFS is that it cannot be direct-

The filesystem uses the TCP/IP layer for communication; ly mounted by an existing operating system. Getting data

clients use RPC to communicate between each other. into and out of the HDFS file system, an action that often

HDFS stores large files (an ideal file size is a multiple of 64 needs to be performed before and after executing a job,

MB[9]), across multiple machines. It achieves reliability can be inconvenient. A Filesystem in Userspace (FUSE)

by replicating the data across multiple hosts, and hence virtual file system has been developed to address this

does not require RAID storage on hosts. With the default problem, at least for Linux and some other Unix systems.

replication value, 3, data is stored on three nodes: two on File access can be achieved through the native Java

the same rack, and one on a different rack. Data nodes API, the Thrift API to generate a client in the language

can talk to each other to rebalance data, to move copies of the users’ choosing (C++, Java, Python, PHP, Ruby, Er-

around, and to keep the replication of data high. HDFS lang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the

is not fully POSIX compliant because the requirements command-line interface, or browsed through the HDFS-

for a POSIX filesystem differ from the target goals for a UI webapp over HTTP.

Hadoop application. The tradeoff of not having a fully

Other Filesystems

POSIX compliant filesystem is increased performance for

data throughput. HDFS was designed to handle very large By May 2011, the list of supported filesystems included:

files.[9] • HDFS: Hadoop’s own rack-aware filesystem.[12] This

HDFS does not provide high availability, because an is designed to scale to tens of petabytes of storage

HDFS filesystem instance requires one unique server, the and runs on top of the filesystems of the underlying

name node. This is a single point of failure for an HDFS in- operating systems.

stallation. If the name node goes down, the filesystem is • Amazon S3 filesystem. This is targeted at clusters

offline. When it comes back up, the name node must re- hosted on the Amazon Elastic Compute Cloud server-

play all outstanding operations. This replay process can on-demand infrastructure. There is no rack-

take over half an hour for a big cluster.[10] The filesystem awareness in this filesystem, as it is all remote.

includes what is called a Secondary Namenode, which mis- • CloudStore (previously Kosmos Distributed File

leads some people into thinking that when the Primary System), which is rack-aware.

Namenode goes offline, the Secondary Namenode takes • FTP Filesystem: this stores all its data on remotely

over. In fact, the Secondary Namenode regularly con- accessible FTP servers.

nects with the Primary Namenode and builds snapshots • Read-only HTTP and HTTPS file systems.

of the Primary Namenode’s directory information, which Hadoop can work directly with any distributed file sys-

is then saved to local/remote directories. These check- tem that can be mounted by the underlying operating

pointed images can be used to restart a failed Primary system simply by using a file:// URL; however, this comes

Namenode without having to replay the entire journal at a price: the loss of locality. To reduce network traffic,

of filesystem actions, then edit the log to create an up- Hadoop needs to know which servers are closest to the

to-date directory structure. Since Namenode is the single data; this is information which Hadoop-specific filesys-

point for storage and management of metadata, this can tem bridges can provide.

be a bottleneck for supporting huge number of files, es- Out-of-the-box, this includes Amazon S3, and the

pecially large number of small files. HDFS Federation is a CloudStore filestore, through s3:// and kfs:// URLs di-

new addition which aims to tackle this problem to a cer- rectly.

tain extent by allowing multiple namespaces served by A number of third party filesystem bridges have also

separate Namenodes . been written, none of which are currently in Hadoop dis-

An advantage of using HDFS is data awareness be- tributions. These may offer superior availability or scal-

tween the jobtracker and tasktracker. The jobtracker ability, and possibly a more general-purpose filesystem

schedules map/reduce jobs to tasktrackers with an than HDFS, which is biased towards large files and only

awareness of the data location. An example of this would offers a subset of the expected semantics of a Posix

be if node A contained data (x,y,z) and node B contained Filesystem: no locking, or writing to anywhere other

than the tail of a file.



2

From Wikipedia, the free encyclopedia Apache Hadoop





• In 2009 IBM discussed running Hadoop over the IBM • If one TaskTracker is very slow, it can delay the

General Parallel File System.[13] The source code was entire MapReduce job - especially towards the end of

published in October 2009.[14] a job, where everything can end up waiting for the

• In April 2010, Parascale published the source code to slowest task. With speculative-execution enabled,

run Hadoop against the Parascale filesystem.[15] however, a single task can be executed on multiple

• In April 2010, Appistry released a Hadoop filesystem slave nodes.

driver for use with its own CloudIQ Storage

product.[16] Scheduling

• In June 2010, HP discussed a location-aware IBRIX By default Hadoop uses FIFO, and optional 5 scheduling

Fusion filesystem driver.[17] priorities to schedule jobs from a work queue.[18] In ver-

• In May 2011, MapR Technologies, Inc. announced the sion 0.19 the job scheduler was refactored out of the

availability of an alternate filesystem for Hadoop, JobTracker, and added the ability to use an alternate

which replaced the HDFS file system with a full scheduler (such as the Fair scheduler or the Capacity sched-

random-access read/write file system, with uler).[19]

advanced features like snaphots and mirrors, and get

Fair scheduler

rid of the single point of failure issue of the default

The fair scheduler was developed by Facebook. The goal

HDFS NameNode.

of the fair scheduler is to provide fast response times for

small jobs and QoS for production jobs. The fair scheduler

JobTracker and TaskTracker: the

has three basic concepts.[20]

MapReduce engine 1. Jobs are grouped into Pools.

Above the file systems comes the MapReduce engine, 2. Each pool is assigned a guaranteed minimum share.

which consists of one JobTracker, to which client appli- 3. Excess capacity is split between jobs.

cations submit MapReduce jobs. The JobTracker pushes By default jobs that are uncategorized go into a default

work out to available TaskTracker nodes in the cluster, pool. Pools have to specify the minimum number of map

striving to keep the work as close to the data as possible. slots, reduce slots, and a limit on the number of running

With a rack-aware filesystem, the JobTracker knows jobs.

which node contains the data, and which other machines

Capacity scheduler

are nearby. If the work cannot be hosted on the actual

The capacity scheduler was developed by Yahoo. The ca-

node where the data resides, priority is given to nodes in

pacity scheduler supports several features which are

the same rack. This reduces network traffic on the main

similar to the fair scheduler.[21]

backbone network. If a TaskTracker fails or times out,

• Jobs are submitted into queues.

that part of the job is rescheduled. The TaskTracker on

• Queues are allocated a fraction of the total resource

each node spawns off a separate Java Virtual Machine

capacity.

process to prevent the TaskTracker itself from failing if

• Free resources are allocated to queues beyond their

the running job crashes the JVM. A heartbeat is sent from

total capacity.

the TaskTracker to the JobTracker every few minutes to

• Within a queue a job with a high level of priority will

check its status. The Job Tracker and TaskTracker status

have access to the queue’s resources.

and information is exposed by Jetty and can be viewed

There is no preemption once a job is running.

from a web browser.

If the JobTracker failed on Hadoop 0.20 or earlier, all

ongoing work was lost. Hadoop version 0.21 added some

Other applications

checkpointing to this process; the JobTracker records The HDFS filesystem is not restricted to MapReduce jobs.

what it is up to in the filesystem. When a JobTracker It can be used for other applications, many of which are

starts up, it looks for any such data, so that it can restart under development at Apache. The list includes the

work from where it left off. In earlier versions of Hadoop, HBase database, the Apache Mahout machine learning

all active work was lost when a JobTracker restarted. system, and the Apache Hive Data Warehouse system.

Known limitations of this approach are: Hadoop can in theory be used for any sort of work that

• The allocation of work to TaskTrackers is very is batch-oriented rather than real-time, that is very data-

simple. Every TaskTracker has a number of available intensive, and able to work on pieces of the data in par-

slots (such as "4 slots"). Every active map or reduce allel. As of October 2009, commercial applications of

task takes up one slot. The Job Tracker allocates Hadoop[22] included:

work to the tracker nearest to the data with an • Log and/or clickstream analysis of various kinds

available slot. There is no consideration of the • Marketing analytics

current system load of the allocated machine, and • Machine learning and/or sophisticated data mining

hence its actual availability. • Image processing





3

From Wikipedia, the free encyclopedia Apache Hadoop





• Processing of XML messages • Freebase

• Web crawling and/or text processing • Hewlett-Packard

• General archiving, including of relational/tabular • IBM

data, e.g. for compliance • InMobi [29]

• ImageShack

Prominent users •



ISI

Joost

• Last.fm

Yahoo! • LinkedIn[30]

On February 19, 2008, Yahoo! Inc. launched what it • Microsoft[31]

claimed was the world’s largest Hadoop production ap- • Meebo

plication. The Yahoo! Search Webmap is a Hadoop appli- • Mendeley

cation that runs on more than 10,000 core Linux cluster • Metaweb

and produces data that is now used in every Yahoo! Web • Netflix[32]

search query.[23] • The New York Times

There are multiple Hadoop clusters at Yahoo!, and • Ning

no HDFS filesystems or MapReduce jobs are split across • Outbrain

multiple datacenters. Every hadoop cluster node boot- • Playdom (now part of Disney Interactive Media

straps the Linux image, including the Hadoop distribu- Group)

tion. Work that the clusters perform is known to include • Powerset (now part of Microsoft)

the index calculations for the Yahoo! search engine. • Rackspace

On June 10, 2009, Yahoo! made available the source • Razorfish

code to the version of Hadoop it runs in production.[24] • StumbleUpon[33]

Yahoo! contributes back all work it does on Hadoop to • Twitter

the open-source community, the company’s developers • Mitula[34]

also fix bugs and provide stability improvements inter-

nally, and release this patched source code so that other

users may benefit from their effort.

Hadoop on Amazon EC2/S3 ser-

vices

Facebook

It is possible to run Hadoop on Amazon Elastic Compute

In 2010 Facebook claimed that they have the largest

Cloud (EC2) and Amazon Simple Storage Service (S3).[35]

Hadoop cluster in the world with 21 PB of storage.[25] On

As an example The New York Times used 100 Amazon

July 27, 2011 they announced the data has grown to 30

EC2 instances and a Hadoop application to process 4 TB

PB.[26]

of raw image TIFF data (stored in S3) into 11 million fin-

ished PDFs in the space of 24 hours at a computation cost

Other users of about $240 (not including bandwidth).[36]

Besides Facebook and Yahoo!, many other organizations There is support for the S3 filesystem in Hadoop dis-

are using Hadoop to run large distributed computations. tributions, and the Hadoop team generates EC2 machine

Some of the notable users include:[2] images after every release. From a pure performance per-

• 1&1 spective, Hadoop on S3/EC2 is inefficient, as the S3

• A9.com filesystem is remote and delays returning from every

• About.com write operation until the data is guaranteed not to be

• Amazon.com lost. This removes the locality advantages of Hadoop,

• American Airlines which schedules work near data to save on network load.

• AOL

• Apple[27] Amazon Elastic MapReduce

• Booz Allen Hamilton

Elastic MapReduce was introduced by Amazon in April

• Cerner

2009. Provisioning of the Hadoop cluster, running and

• ChaCha

terminating jobs, and handling data transfer between

• comScore[28]

EC2 and S3 are automated by Elastic MapReduce. Apache

• EHarmony

Hive, which is built on top of Hadoop for providing data

• eBay

warehouse services, is also offered in Elastic MapRe-

• Federal Reserve Board of Governors

duce.[37]

• foursquare

Support for using Spot Instances was later added in

• Fox Interactive Media

August 2011.[38] Elastic MapReduce is fault tolerant for



4

From Wikipedia, the free encyclopedia Apache Hadoop





slave failures,[39] and it is recommended to only run the • Cloudera offers CDH (Cloudera’s Distribution

Task Instance Group on spot instances to take advantage including Apache Hadoop) and Cloudera

of the lower cost while maintaining availability. [40] Enterprise.[48]

• IBM offers InfoSphere BigInsights[49] based on

Hadoop at Google and IBM Hadoop in both a basic and enterprise edition.[50]

• Zettaset offers new version of it’s Big Data Mgt

IBM and Google announced an initiative in 2007 to use Platform[51] based on Hadoop Zettaset’s Big Data

Hadoop to support university courses in distributed com- Platform delivers High Availability via NameNode

puter programming.[41] Failover, a streamlined UI, network Time Protocol

In 2008 this collaboration, the Academic Cloud Com- and built in security via Kerberos Authentication

puting Initiative (ACCI), partnered with the National • In March 2011, Platform Computing announced

Science Foundation to provide grant funding to academic support for the Hadoop MapReduce API in its

researchers interested in exploring large-data applica- Symphony software.[52]

tions. This resulted in the creation of the Cluster Ex- • In May 2011, MapR Technologies, Inc. announced the

ploratory (CLuE) program.[42] availability of their distributed filesystem and

MapReduce engine, the MapR Distribution for

Running Hadoop in compute Apache Hadoop.[53] The MapR product includes most

Hadoop eco-system components and adds

farm environments capabilities such as snapshots, mirrors, NFS access

and full read-write file semantics.[54]

Hadoop can also be used in compute farms and high-per-

• Silicon Graphics International offers Hadoop

formance computing environments. Instead of setting up

optimized solutions based on the SGI Rackable and

a dedicated Hadoop cluster, an existing compute farm

CloudRack server lines with implementation

can be used if the resource manager of the cluster is

services.[55]

aware of the Hadoop jobs, and thus Hadoop jobs can be

• EMC released EMC Greenplum Community Edition and

scheduled like other jobs in the cluster.

EMC Greenplum HD Enterprise Edition in May 2011. The

community edition, with optional for-fee technical

Grid Engine Integration support, consists of Hadoop, HDFS, HBase, Hive, and

Integration with Sun Grid Engine was released in 2008, the ZooKeeper configuration service. The enterprise

and running Hadoop on Sun Grid (Sun’s on-demand util- edition is an offering based on the MapR product,

ity computing service) was possible.[43] In the initial im- and offers proprietary features such as snapshots

plementation of the integration, the CPU-time scheduler and wide area replication.[56][57]

has no knowledge of the locality of the data. Unfortu- • In June 2011, Yahoo! and Benchmark Capital formed

nately, this means that the processing is not always done Hortonworks Inc., whose focus is on making Hadoop

on the same rack as the data; this was a key feature more robust and easier to install, manage and use for

of the Hadoop Runtime. An improved integration with enterprise users.[58]

data-locality was announced during the Sun HPC Soft- • Google added AppEngine-MapReduce to support

ware Workshop ’09.[44] running Hadoop 0.20 programs on Google App

In 2008-2009 Sun released the Hadoop Live CD OpenSo- Engine.[59][60]

laris project, which allows running a fully functional • In Oct 2011, Oracle announced the Big Data Appliance,

Hadoop cluster using a live CD.[45] This distribution in- which integrates Hadoop, Oracle Enterprise Linux,

cludes Hadoop 0.19 -as of April 2010 there has not been the R programming language, and a NoSQL database

an updated release. with the Exadata hardware.[61][62]

• Dovestech has released Ocean Sync Hadoop

Condor Integration Management Software Freeware Edition. The

The Condor High-Throughput Computing System inte- software allows users to control and monitor all

gration was presented at the Condor Week conference in aspects of an Hadoop cluster.[63]

2010.[46] • Grand Logic’s JobServer[64] product allows

developers and admins to deploy, manage and

Commercially supported monitor their Hadoop infrastructure, with support

for Hadoop job processing and HDFS file/content

Hadoop-related products management.



There are a number of companies offering commercial

implementations and/or providing support for

Hadoop.[47]





5

From Wikipedia, the free encyclopedia Apache Hadoop





ASF’s view on the use of "Hadoop" in • Data Intensive Computing

product names

The Apache Software Foundation has stated that only References

software officially released by the Apache Hadoop Pro- [1] "Hadoop is a framework for running applications

ject can be called Apache Hadoop or Distributions of Apache on large clusters of commodity hardware. The

Hadoop.[65] The naming of products and derivative works Hadoop framework transparently provides

from other vendors and the term "compatible" are some- applications both reliability and data motion.

what controversial within the Hadoop developer com- Hadoop implements a computational paradigm

munity.[66] named map/reduce, where the application is

divided into many small fragments of work, each of

Papers which may be executed or re-executed on any node

in the cluster. In addition, it provides a distributed

Some papers influenced the birth and growth of Hadoop file system that stores data on the compute nodes,

and big data processing. Here is a partial list: providing very high aggregate bandwidth across

• 2004 MapReduce: Simplified Data Processing on the cluster. Both map/reduce and the distributed

Large Clusters by Jeffrey Dean and Sanjay Ghemawat file system are designed so that node failures are

from Google Lab. This paper inspired Doug Cutting to automatically handled by the framework." Hadoop

develop an open-source implementation of the Map- Overview

Reduce framework. He named it Hadoop, after his [2] ^ Applications and organizations using Hadoop

son’s toy elephant. [3] Hadoop Credits Page

• 2005 From Databases to Dataspaces: A New [4] Yahoo! Launches World’s Largest Hadoop

Abstraction for Information Management, the Production Application

authors highlight the need for storage systems to [5] Hadoop creator goes to Cloudera

accept all data formats and to provide APIs for data [6] Ashlee Vance (2009-03-17). "Hadoop, a Free

access that evolve based on the storage system’s Software Program, Finds Uses Beyond Search". New

understanding of the data. York Times. http://www.nytimes.com/2009/03/17/

• 2006 Bigtable: A Distributed Storage System for technology/business-computing/17cloud.html.

Structured Data from Google Lab. Retrieved 2010-01-20.

• 2008 H-store: a high-performance, distributed main [7] "Hadoop contains the distributed computing

memory transaction processing system platform that was formerly a part of Nutch. This

• 2009 MAD Skills: New Analysis Practices for Big Data includes the Hadoop Distributed Filesystem (HDFS)

• 2011 Apache Hadoop Goes Realtime at Facebook and an implementation of MapReduce." About

Hadoop

See also [8] http://hadoop.apache.org/common/docs/r0.20.2/

hdfs_user_guide.html#Rack+Awareness

• Nutch - an effort to build an open source search [9] ^ The Hadoop Distributed File System: Architecture

engine based on Lucene and Hadoop. Also created by and Design

Doug Cutting. [10] Improve Namenode startup performance. "Default

• Datameer Analytics Solution (DAS) – data source scenario for 20 million files with the max Java heap

integration, storage, analytics engine and size set to 14 GB: 40 minutes. Tuning various Java

visualization options such as young size, parallel garbage

• HBase - BigTable-model database. collection, initial Java heap size : 14 minutes"

• Hypertable - HBase alternative [11] [1] Improving MapReduce Performance through

• MapReduce - Hadoop’s fundamental data filtering Data Placement in Heterogeneous Hadoop Clusters

algorithm April 2010

• Apache Mahout - Machine Learning algorithms [12] HDFS Users Guide - Rack Awareness

implemented on Hadoop [13] ", "Cloud analytics: Do we really need to reinvent

• Apache Cassandra - A column-oriented database that the storage stack?"". IBM. 2009-06.

supports access from Hadoop http://www.usenix.org/events/hotcloud09/tech/

• HPCC - LexisNexis Risk Solutions High Performance full_papers/ananthanarayanan.pdf.

Computing Cluster [14] "HADOOP-6330: Integrating IBM General Parallel

• Sector/Sphere - Open source distributed storage and File System implementation of Hadoop Filesystem

processing interface". IBM. 2009-10-23.

• Cloud computing https://issues.apache.org/jira/browse/

• Big data HADOOP-6330.



6

From Wikipedia, the free encyclopedia Apache Hadoop





[15] "HADOOP-6704: add support for Parascale [35] http://aws.typepad.com/aws/2008/02/taking-

filesystem". Parascale. 2010-04-14. massive.html Running Hadoop on Amazon EC2/S3

https://issues.apache.org/jira/browse/ [36] Gottfrid, Derek (November 1, 2007). "Self-service,

HADOOP-6704. Prorated Super Computing Fun!". The New York

[16] "Replace HDFS with CloudIQ Storage". Appistry,Inc. Times. http://open.blogs.nytimes.com/2007/11/

2010-07-06. http://www.appistry.com/ 01/self-service-prorated-super-computing-fun/

community/wiki/display/cloudiq43/ ?scp=1&sq=self%20service%20prorated&st=cse.

Replace+HDFS+with+CloudIQ+Storage. Retrieved May 4, 2010.

[17] "High Availability Hadoop". HP. 2010-06-09. [37] Amazon Elastic MapReduce Developer Guide

http://www.slideshare.net/steve_l/high- [38] Amazon Elastic MapReduce Now Supports Spot

availability-hadoop. Instances

[18] job [39] Amazon Elastic MapReduce FAQs

[19] [ https://issues.apache.org/jira/browse/ [40] Using Spot Instances with EMR on YouTube

HADOOP-3412] #HADOOP-3412 Refactor the [41] Google Press Center: Google and IBM Announce

scheduler out of the JobTracker - ASF JIRA University Initiative to Address Internet-Scale

[20] [2] Hadoop Fair Scheduler Design Document Computing Challenges

[21] [3] Capacity Scheduler Guide [42] NSF, Google, IBM form CLuE

[22] "How 30+ enterprises are using Hadoop", in DBMS2 [43] "Creating Hadoop pe under SGE". Sun

[23] Yahoo! Launches World’s Largest Hadoop Microsystems. 2008-01-16. http://blogs.sun.com/

Production Application (Hadoop and Distributed ravee/entry/creating_hadoop_pe_under_sge.

Computing at Yahoo!) [44] "HDFS-Aware Scheduling With Grid Engine". Sun

[24] Hadoop and Distributed Computing at Yahoo! Microsystems. 2009-09-10. http://wikis.sun.com/

[25] [4] display/SunHPC09/

[26] [5] Sun+HPC+Software+Workshop+’09+Wiki.

[27] "Apple Embraces Hadoop". [45] "OpenSolaris Project: Hadoop Live CD". Sun

http://www.theregister.co.uk/2010/12/01/ Microsystems. 2008-08-29. http://opensolaris.org/

apple_embraces_hadoop/. Retrieved 2011-04-14. os/project/livehadoop/.

[28] "Using Hadoop to tackle Big Data at comScore". [46] "Condor integrated with Hadoop’s Map Reduce".

http://www.cloudera.com/videos/ University of Wisconsin–Madison. 2010-04-15.

hw10_video_using_hadoop_to_tackle_big_data_at_comscore. http://www.cs.wisc.edu/condor/

[29] "InMobi Ranked as a Top 10 Contributor to Apache CondorWeek2010/condor-presentations/thain-

Hadoop". http://www.inmobi.com/inmobiblog/ condor-hadoop.pdf.

2011/10/07/inmobi-ranked-as-a- [47] Why the Pace of Hadoop Innovation Has to Pick Up

top-10-contributor-to-apache-hadoop/. Retrieved [48] Cloudera’s Distribution including Apache Hadoop

2011-10-07. [49] IBM InfoSphere BigInsights

[30] "Building a terabyte-scale data cycle at LinkedIn [50] IBM InfoSphere BigInsights Enterprise Edition

with Hadoop and Project Voldemort". analytics platform enables new class of solutions

http://project-voldemort.com/blog/2009/06/ for gaining rapid insight through large-scale

building-a-1-tb-data-cycle-at-linkedin-with- analysis of diverse data

hadoop-and-project-voldemort/. Retrieved [51] [6]

2011-04-14. [52] Platform Computing Announces Support for

[31] "Microsoft Expands Data Platform With SQL Server MapReduce

2012, New Investments for Managing Any Data, [53] MapR Distribution for Apache Hadoop

Any Size, Anywhere". http://www.microsoft.com/ [54] http://mapr.com/products/mapr-editions/

Presspass/press/2011/oct11/10-12PASS1PR.mspx. m5-edition.html

Retrieved 2011-10-13. [55] Hadoop optimized solutions from SGI

[32] "Use Case Study of Hive/Hadoop". [56] Greenplum Community

http://www.slideshare.net/evamtse/hive-user- [57] Greenplum HD: Enterprise-Ready Apache Hadoop

group-presentation-from-netflix-3182010-3483386. [58] Yahoo! and Benchmark Capital to Form

Retrieved 2011-04-14. Hortonworks to Increase Investment in Hadoop

[33] "HBase at StumbleUpon". Technology and Accelerate Innovation and

http://www.stumbleupon.com/devblog/ Adoption

hbase_at_stumbleupon/. Retrieved 2010-06-26. [59] appengine-mapreduce - Google App Engine API for

[34] "Mitula Search/Hadoop". running MapReduce jobs

http://www.mitula.co.uk. Retrieved 2011-09-06. [60] Google I/O 2011: App Engine MapReduce on

YouTube



7

From Wikipedia, the free encyclopedia Apache Hadoop





[61] Oracle Unveils the Oracle Big Data Appliance • White, Tom (June 16, 2009). Hadoop: The Definitive

[62] Oracle rolls its own NoSQL and Hadoop Guide (1st ed.). O’Reilly Media. p. 524.

[63] http://www.oceansync.com OceanSync.com ISBN 0-596-52197-9. http://oreilly.com/catalog/

Hadoop Management 9780596521974.

[64] http://www.grandlogic.com/content/html_docs/ • Holmes, Alex (Softbound print: Fall 2012). Hadoop In

js_features.shtml Practice (1st ed.). Manning Publications. p. 425.

[65] Defining Hadoop ISBN 9781617290237. http://www.manning.com/

[66] Defining Hadoop Compatibility: revisited holmes/.





Bibliography External links

• Lam, Chuck (July 28, 2010). Hadoop in Action (1st ed.). • Official Hadoop Homepage

Manning Publications. p. 325. ISBN 1-935-18219-6. • Introducing Apache Hadoop: The Modern Data

• Venner, Jason (June 22, 2009). Pro Hadoop (1st ed.). Operating System — lecture given at Stanford

Apress. p. 440. ISBN 1-430-21942-4. University by Co-Founder and CTO of Cloudera, Amr

http://www.apress.com/book/view/1430219424. Awadallah (video archive).









Retrieved from "http://en.wikipedia.org/w/index.php?title=Apache_Hadoop&oldid=473736623"



Categories:

• Hadoop

• Free software programmed in Java

• Free system software

• Distributed file systems

• Cloud computing

• Cloud infrastructure





This page was last modified on 28 January 2012 at 20:00. Text is available under the Creative Commons Attribution-

ShareAlike License; additional terms may apply. See Terms of use for details. Wikipedia® is a registered trademark of

the Wikimedia Foundation, Inc., a non-profit organization.Contact us

Privacy policy About Wikipedia Disclaimers



8



Related docs
Other docs by roy ashbrook
Philip_Taaffe
Views: 53  |  Downloads: 0
Philip_Dodd__broadcaster_
Views: 43  |  Downloads: 0
Philippa_of_Champagne
Views: 41  |  Downloads: 0
Philadelphians
Views: 30  |  Downloads: 0
Phaansi
Views: 27  |  Downloads: 0
Peykasa
Views: 25  |  Downloads: 0
Pet_door
Views: 47  |  Downloads: 0
Peter_Rice__Chairman_of_Fox_Broadcasting_
Views: 40  |  Downloads: 0
Perittia_farinella
Views: 20  |  Downloads: 0
Perissoza_scripta
Views: 24  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!