Embed
Email

Apache_Hadoop

Document Sample

Shared by: roy ashbrook
Categories
Tags
Stats
views:
0
posted:
12/31/2011
language:
pages:
8
From Wikipedia, the free encyclopedia Apache Hadoop









Apache Hadoop

Apache Hadoop and, failing that, on the same rack/switch, so reducing

backbone traffic. The Hadoop Distributed File System

(HDFS) uses this when replicating data, to try to keep dif-

ferent copies of the data on different racks. The goal is to

Developer(s) Apache Software Foundation reduce the impact of a rack power outage or switch fail-

ure so that even if these events occur, the data may still

Stable release 0.20.204 / September 11,

be readable.[8]

2011 (2011-09-11)



Preview release 0.21.0 / August 23, 2010 (2010-08-23)



Development Active

status



Written in Java



Operating system Cross-platform



Type Distributed File System



License Apache License 2.0



Website hadoop.apache.org



Apache Hadoop is a software framework that supports

data-intensive distributed applications under a free li-

cense.[1] It enables applications to work with thousands A multi-node Hadoop cluster

of nodes and petabytes of data. Hadoop was inspired by

Google’s MapReduce and Google File System (GFS) pa- A small Hadoop cluster will include a single master

pers. and multiple worker nodes. The master node consists

Hadoop is a top-level Apache project being built and of a JobTracker, TaskTracker, NameNode, and DataNode.

used by a global community of contributors,[2] written A slave or worker node acts as both a DataNode and

in the Java programming language. Yahoo! has been the TaskTracker, though it is possible to have data-only

largest contributor[3] to the project, and uses Hadoop ex- worker nodes, and compute-only worker nodes; these

tensively across its businesses.[4] are normally only used in non-standard applications.

Hadoop was created by Doug Cutting,[5] who named it Hadoop requires JRE 1.6 or higher. The standard startup

after his son’s toy elephant.[6] It was originally developed and shutdown scripts require ssh to be set up between

to support distribution for the Nutch search engine pro- nodes in the cluster.

ject.[7] In a larger cluster, the HDFS is managed through a

dedicated NameNode server to host the filesystem index,

Architecture and a secondary NameNode that can generate snapshots

of the namenode’s memory structures, thus preventing

Hadoop consists of the Hadoop Common, which provides filesystem corruption and reducing loss of data. Similar-

access to the filesystems supported by Hadoop. The ly, a standalone JobTracker server can manage job sched-

Hadoop Common package contains the necessary JAR uling. In clusters where the Hadoop MapReduce engine is

files and scripts needed to start Hadoop. The package also deployed against an alternate filesystem, the NameNode,

provides source code, documentation, and a contribution secondary NameNode and DataNode architecture of

section which includes projects from the Hadoop Com- HDFS is replaced by the filesystem-specific equivalent.

munity.

For effective scheduling of work, every Hadoop-com- Filesystems

patible filesystem should provide location awareness: the

name of the rack (more precisely, of the network switch) Hadoop Distributed File System

where a worker node is. Hadoop applications can use this The HDFS is a distributed, scalable, and portable filesys-

information to run work on the node where the data is, tem written in Java for the Hadoop framework. Each





1

From Wikipedia, the free encyclopedia Apache Hadoop





node in a Hadoop instance typically has a single datan- ways available. This can have a significant impact on the

ode; a cluster of datanodes form the HDFS cluster. The performance of job completion times, which has been

situation is typical because each node does not require a demonstrated when running data intensive jobs.[11]

datanode to be present. Each datanode serves up blocks Another limitation of HDFS is that it cannot be direct-

of data over the network using a block protocol specific ly mounted by an existing operating system. Getting data

to HDFS. The filesystem uses the TCP/IP layer for com- into and out of the HDFS file system, an action that often

munication; clients use RPC to communicate between needs to be performed before and after executing a job,

each other. The HDFS stores large files (an ideal file size can be inconvenient. A Filesystem in Userspace (FUSE)

is a multiple of 64 MB[9]), across multiple machines. It virtual file system has been developed to address this

achieves reliability by replicating the data across multi- problem, at least for Linux and some other Unix systems.

ple hosts, and hence does not require RAID storage on File access can be achieved through the native Java

hosts. With the default replication value, 3, data is stored API, the Thrift API to generate a client in the language

on three nodes: two on the same rack, and one on a dif- of the users’ choosing (C++, Java, Python, PHP, Ruby, Er-

ferent rack. Data nodes can talk to each other to rebalan- lang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the

ce data, to move copies around, and to keep the replica- command-line interface, or browsed through the HDFS-

tion of data high. HDFS is not fully POSIX compliant be- UI webapp over HTTP.

cause the requirements for a POSIX filesystem differ from

the target goals for a Hadoop application. The tradeoff Other Filesystems

of not having a fully POSIX compliant filesystem is in- By May 2011, the list of supported filesystems included:

creased performance for data throughput. The HDFS was • HDFS: Hadoop’s own rack-aware filesystem.[12] This

designed to handle very large files.[9] is designed to scale to tens of petabytes of storage

The HDFS does not provide high availability, because and runs on top of the filesystems of the underlying

an HDFS filesystem instance requires one unique server, operating systems.

the name node. This is a single point of failure for an HDFS • Amazon S3 filesystem. This is targeted at clusters

installation. If the name node goes down, the filesystem hosted on the Amazon Elastic Compute Cloud server-

is offline. When it comes back up, the name node must re- on-demand infrastructure. There is no rack-

play all outstanding operations. This replay process can awareness in this filesystem, as it is all remote.

take over half an hour for a big cluster.[10] The filesystem • CloudStore (previously Kosmos Distributed File

includes what is called a Secondary Namenode, which mis- System), which is rack-aware.

leads some people into thinking that when the Primary • FTP Filesystem: this stores all its data on remotely

Namenode goes offline, the Secondary Namenode takes accessible FTP servers.

over. In fact, the Secondary Namenode regularly con- • Read-only HTTP and HTTPS file systems.

nects with the Primary Namenode and builds snapshots Hadoop can work directly with any distributed file sys-

of the Primary Namenode’s directory information, which tem that can be mounted by the underlying operating

is then saved to local/remote directories. These check- system simply by using a file:// URL; however, this comes

pointed images can be used to restart a failed Primary at a price: the loss of locality. To reduce network traffic,

Namenode without having to replay the entire journal Hadoop needs to know which servers are closest to the

of filesystem actions, then edit the log to create an up- data; this is information which Hadoop-specific filesys-

to-date directory structure. Since Namenode is the single tem bridges can provide.

point for storage and management of metadata, this can Out-of-the-box, this includes Amazon S3, and the

be a bottleneck for supporting huge number of files, es- CloudStore filestore, through s3:// and kfs:// URLs di-

pecially large number of small files. HDFS Federation is a rectly.

new addition which aims to tackle this problem to a cer- A number of third party filesystem bridges have also

tain extent by allowing multiple namespaces served by been written, none of which are currently in Hadoop dis-

separate Namenodes . tributions. These may offer superior availability or scal-

An advantage of using the HDFS is data awareness ability, and possibly a more general-purpose filesystem

between the jobtracker and tasktracker. The jobtracker than HDFS, which is biased towards large files and only

schedules map/reduce jobs to tasktrackers with an offers a subset of the expected semantics of a Posix

awareness of the data location. An example of this would Filesystem: no locking, or writing to anywhere other

be if node A contained data (x,y,z) and node B contained than the tail of a file.

data (a,b,c). The jobtracker will schedule node B to per- • In 2009 IBM discussed running Hadoop over the IBM

form map/reduce tasks on (a,b,c) and node A would be General Parallel File System.[13] The source code was

scheduled to perform map/reduce tasks on (x,y,z). This published in October 2009.[14]

reduces the amount of traffic that goes over the network • In April 2010, Parascale published the source code to

and prevents unnecessary data transfer. When Hadoop run Hadoop against the Parascale filesystem.[15]

is used with other filesystems this advantage is not al-



2

From Wikipedia, the free encyclopedia Apache Hadoop





• In April 2010, Appistry released a Hadoop filesystem however, a single task can be executed on multiple

driver for use with its own CloudIQ Storage slave nodes.

product.[16]

• In June 2010, HP discussed a location-aware IBRIX Scheduling

Fusion filesystem driver.[17] By default Hadoop uses FIFO, and optional 5 scheduling

• In May 2011, MapR Technologies, Inc. announced the priorities to schedule jobs from a work queue.[18] In ver-

availability of an alternate filesystem for Hadoop, sion 0.19 the job scheduler was refactored out of the

which replaced the HDFS file system with a full JobTracker, and added the ability to use an alternate

random-access read/write file system, with scheduler (such as the Fair scheduler or the Capacity sched-

advanced features like snaphots and mirrors, and get uler).[19]

rid of the single point of failure issue of the default

Fair scheduler

HDFS NameNode.

The fair scheduler was developed by Facebook. The goal

of the fair scheduler is to provide fast response times for

JobTracker and TaskTracker: the

small jobs and QoS for production jobs. The fair scheduler

MapReduce engine has three basic concepts.[20]

Above the file systems comes the MapReduce engine, 1. Jobs are grouped into Pools.

which consists of one JobTracker, to which client appli- 2. Each pool is assigned a guaranteed minimum share.

cations submit MapReduce jobs. The JobTracker pushes 3. Excess capacity is split between jobs.

work out to available TaskTracker nodes in the cluster, By default jobs that are uncategorized go into a default

striving to keep the work as close to the data as possible. pool. Pools have to specify the minimum number of map

With a rack-aware filesystem, the JobTracker knows slots, reduce slots, and a limit on the number of running

which node contains the data, and which other machines jobs.

are nearby. If the work cannot be hosted on the actual

Capacity scheduler

node where the data resides, priority is given to nodes in

The capacity scheduler was developed by Yahoo. The ca-

the same rack. This reduces network traffic on the main

pacity scheduler supports several features which are

backbone network. If a TaskTracker fails or times out,

similar to the fair scheduler.[21]

that part of the job is rescheduled. The TaskTracker on

• Jobs are submitted into queues.

each node spawns off a separate Java Virtual Machine

• Queues are allocated a fraction of the total resource

process to prevent the TaskTracker itself from failing if

capacity.

the running job crashes the JVM. A heartbeat is sent from

• Free resources are allocated to queues beyond their

the TaskTracker to the JobTracker every few minutes to

total capacity.

check its status. The Job Tracker and TaskTracker status

• Within a queue a job with a high level of priority will

and information is exposed by Jetty and can be viewed

have access to the queue’s resources.

from a web browser.

There is no preemption once a job is running.

If the JobTracker failed on Hadoop 0.20 or earlier, all

ongoing work was lost. Hadoop version 0.21 added some

checkpointing to this process; the JobTracker records

Other applications

what it is up to in the filesystem. When a JobTracker The HDFS filesystem is not restricted to MapReduce jobs.

starts up, it looks for any such data, so that it can restart It can be used for other applications, many of which are

work from where it left off. In earlier versions of Hadoop, under development at Apache. The list includes the

all active work was lost when a JobTracker restarted. HBase database, the Apache Mahout machine learning

Known limitations of this approach are: system, and the Apache Hive Data Warehouse system.

• The allocation of work to TaskTrackers is very Hadoop can in theory be used for any sort of work that

simple. Every TaskTracker has a number of available is batch-oriented rather than real-time, that is very data-

slots (such as "4 slots"). Every active map or reduce intensive, and able to work on pieces of the data in par-

task takes up one slot. The Job Tracker allocates allel. As of October 2009, commercial applications of

work to the tracker nearest to the data with an Hadoop[22] included:

available slot. There is no consideration of the • Log and/or clickstream analysis of various kinds

current system load of the allocated machine, and • Marketing analytics

hence its actual availability. • Machine learning and/or sophisticated data mining

• If one TaskTracker is very slow, it can delay the • Image processing

entire MapReduce job - especially towards the end of • Processing of XML messages

a job, where everything can end up waiting for the • Web crawling and/or text processing

slowest task. With speculative-execution enabled, • General archiving, including of relational/tabular

data, e.g. for compliance





3

From Wikipedia, the free encyclopedia Apache Hadoop





Prominent users •



Last.fm

LinkedIn[29]

• Microsoft[30]

Yahoo! • Meebo

On February 19, 2008, Yahoo! Inc. launched what it • Mendeley

claimed was the world’s largest Hadoop production ap- • Metaweb

plication. The Yahoo! Search Webmap is a Hadoop appli- • Netflix[31]

cation that runs on more than 10,000 core Linux cluster • The New York Times

and produces data that is now used in every Yahoo! Web • Ning

search query.[23] • Outbrain

There are multiple Hadoop clusters at Yahoo!, and • Playdom (now part of Disney Interactive Media

no HDFS filesystems or MapReduce jobs are split across Group)

multiple datacenters. Every hadoop cluster node boot- • Powerset (now part of Microsoft)

straps the Linux image, including the Hadoop distribu- • Rackspace

tion. Work that the clusters perform is known to include • Razorfish

the index calculations for the Yahoo! search engine. • StumbleUpon[32]

On June 10, 2009, Yahoo! made available the source • Twitter

code to the version of Hadoop it runs in production.[24] • Mitula[33]

Yahoo! contributes back all work it does on Hadoop to

the open-source community, the company’s developers

also fix bugs and provide stability improvements inter-

Hadoop on Amazon EC2/S3 ser-

nally, and release this patched source code so that other vices

users may benefit from their effort.

It is possible to run Hadoop on Amazon Elastic Compute

Facebook Cloud (EC2) and Amazon Simple Storage Service (S3).[34]

As an example The New York Times used 100 Amazon

In 2010 Facebook claimed that they have the largest EC2 instances and a Hadoop application to process 4 TB

Hadoop cluster in the world with 21 PB of storage.[25] On of raw image TIFF data (stored in S3) into 11 million fin-

July 27, 2011 they announced the data has grown to 30 ished PDFs in the space of 24 hours at a computation cost

PB.[26] of about $240 (not including bandwidth).[35]

There is support for the S3 filesystem in Hadoop dis-

Other users tributions, and the Hadoop team generates EC2 machine

Besides Facebook and Yahoo!, many other organizations images after every release. From a pure performance per-

are using Hadoop to run large distributed computations. spective, Hadoop on S3/EC2 is inefficient, as the S3

Some of the notable users include:[2] filesystem is remote and delays returning from every

• 1&1 write operation until the data is guaranteed not to be

• A9.com lost. This removes the locality advantages of Hadoop,

• About.com which schedules work near data to save on network load.

• Amazon.com

• AOL Amazon Elastic MapReduce

• Apple[27] Elastic MapReduce was introduced by Amazon in April

• Booz Allen Hamilton 2009. Provisioning of the Hadoop cluster, running and

• Cerner terminating jobs, and handling data transfer between

• ChaCha EC2 and S3 are automated by Elastic MapReduce. Apache

• comScore[28] Hive, which is built on top of Hadoop for providing data

• EHarmony warehouse services, is also offered in Elastic MapRe-

• eBay duce.[36]

• Federal Reserve Board of Governors Support for using Spot Instances was later added in

• foursquare August 2011.[37] Elastic MapReduce is fault tolerant for

• Fox Interactive Media slave failures,[38] and it is recommended to only run the

• Freebase Task Instance Group on spot instances to take advantage

• Hewlett-Packard of the lower cost while maintaining availability. [39]

• IBM

• ImageShack

• ISI

• Joost





4

From Wikipedia, the free encyclopedia Apache Hadoop





Hadoop at Google and IBM • IBM offers InfoSphere BigInsights[48] based on

Hadoop in both a basic and enterprise edition.[49]

IBM and Google announced an initiative in 2007 to use • In March 2011, Platform Computing announced

Hadoop to support university courses in distributed com- support for the Hadoop MapReduce API in its

puter programming.[40] Symphony software.[50]

In 2008 this collaboration, the Academic Cloud Com- • In May 2011, MapR Technologies, Inc. announced the

puting Initiative (ACCI), partnered with the National availability of their distributed filesystem and

Science Foundation to provide grant funding to academic MapReduce engine, the MapR Distribution for

researchers interested in exploring large-data applica- Apache Hadoop.[51] The MapR product includes most

tions. This resulted in the creation of the Cluster Ex- Hadoop eco-system components and adds

ploratory (CLuE) program.[41] capabilities such as snapshots, mirrors, NFS access

and full read-write file semantics.[52]

Running Hadoop in compute • Silicon Graphics International offers Hadoop

optimized solutions based on the SGI Rackable and

farm environments CloudRack server lines with implementation

services.[53]

Hadoop can also be used in compute farms and high-per-

• EMC released EMC Greenplum Community Edition and

formance computing environments. Instead of setting up

EMC Greenplum HD Enterprise Edition in May 2011. The

a dedicated Hadoop cluster, an existing compute farm

community edition, with optional for-fee technical

can be used if the resource manager of the cluster is

support, consists of Hadoop, HDFS, HBase, Hive, and

aware of the Hadoop jobs, and thus Hadoop jobs can be

the ZooKeeper configuration service. The enterprise

scheduled like other jobs in the cluster.

edition is an offering based on the MapR product,

and offers proprietary features such as snapshots

Grid Engine Integration and wide area replication.[54][55]

Integration with Sun Grid Engine was released in 2008, • In June 2011, Yahoo! and Benchmark Capital formed

and running Hadoop on Sun Grid (Sun’s on-demand util- Hortonworks Inc., whose focus is on making Hadoop

ity computing service) was possible.[42] In the initial im- more robust and easier to install, manage and use for

plementation of the integration, the CPU-time scheduler enterprise users.[56]

has no knowledge of the locality of the data. Unfortu- • Google added AppEngine-MapReduce to support

nately, this means that the processing is not always done running Hadoop 0.20 programs on Google App

on the same rack as the data; this was a key feature Engine.[57][58]

of the Hadoop Runtime. An improved integration with • In Oct 2011, Oracle announced the Big Data Appliance,

data-locality was announced during the Sun HPC Soft- which integrates Hadoop, Oracle Enterprise Linux,

ware Workshop ’09.[43] the R programming language, and a NoSQL database

In 2008-2009 Sun released the Hadoop Live CD OpenSo- with the Exadata hardware.[59][60]

laris project, which allows running a fully functional • Dovestech has released Ocean Sync Hadoop

Hadoop cluster using a live CD.[44] This distribution in- Management Software Freeware Edition. The

cludes Hadoop 0.19 -as of April 2010 there has not been software allows users to control and monitor all

an updated release. aspects of an Hadoop cluster[61].

• Grand Logic’s JobServer[62] product allows

Condor Integration developers and admins to deploy, manage and

The Condor High-Throughput Computing System inte- monitor their Hadoop infrastructure, with support

gration was presented at the Condor Week conference in for Hadoop job processing and HDFS file/content

2010.[45] management.





Commercially supported ASF’s view on the use of "Hadoop" in

product names

Hadoop-related products The Apache Software Foundation has stated that only

There are a number of companies offering commercial software officially released by the Apache Hadoop Pro-

implementations and/or providing support for ject can be called Apache Hadoop or Distributions of Apache

Hadoop.[46] Hadoop.[63] The naming of products and derivative works

• Cloudera offers CDH (Cloudera’s Distribution from other vendors and the term "compatible" are some-

including Apache Hadoop) and Cloudera what controversial within the Hadoop developer com-

Enterprise.[47] munity.[64]







5

From Wikipedia, the free encyclopedia Apache Hadoop





Papers which may be executed or re-executed on any node

in the cluster. In addition, it provides a distributed

Some papers influenced the birth and growth of Hadoop file system that stores data on the compute nodes,

and big data processing. Here is a partial list: providing very high aggregate bandwidth across

• 2004 Simplified Data Processing on Large Clusters by the cluster. Both map/reduce and the distributed

Jeffrey Dean and Sanjay Ghemawat from Google Lab. file system are designed so that node failures are

This paper inspired Doug Cutting to develop an automatically handled by the framework." Hadoop

open-source implementation of the Map-Reduce Overview

framework. He named it Hadoop, after his son’s toy [2] ^ Applications and organizations using Hadoop

elephant. [3] Hadoop Credits Page

• 2005 From Databases to Dataspaces: A New [4] Yahoo! Launches World’s Largest Hadoop

Abstraction for Information Management, the Production Application

authors highlight the need for storage systems to [5] Hadoop creator goes to Cloudera

accept all data formats and to provide APIs for data [6] Ashlee Vance (2009-03-17). "Hadoop, a Free

access that evolve based on the storage system’s Software Program, Finds Uses Beyond Search". New

understanding of the data. York Times. http://www.nytimes.com/2009/03/17/

• 2006 Bigtable: A Distributed Storage System for technology/business-computing/17cloud.html.

Structured Data from Google Lab. Retrieved 2010-01-20.

• 2008 H-store: a high-performance, distributed main [7] "Hadoop contains the distributed computing

memory transaction processing system platform that was formerly a part of Nutch. This

• 2009 MAD Skills: New Analysis Practices for Big Data includes the Hadoop Distributed Filesystem (HDFS)

• 2011 Apache Hadoop Goes Realtime at Facebook and an implementation of MapReduce." About

Hadoop

See also [8] http://hadoop.apache.org/common/docs/r0.20.2/

hdfs_user_guide.html#Rack+Awareness

• Nutch - an effort to build an open source search [9] ^ The Hadoop Distributed File System: Architecture

engine based on Lucene and Hadoop. Also created by and Design

Doug Cutting. [10] Improve Namenode startup performance. "Default

• Datameer Analytics Solution (DAS) – data source scenario for 20 million files with the max Java heap

integration, storage, analytics engine and size set to 14 GB: 40 minutes. Tuning various Java

visualization options such as young size, parallel garbage

• HBase - BigTable-model database. collection, initial Java heap size : 14 minutes"

• Hypertable - HBase alternative [11] [1] Improving MapReduce Performance through

• MapReduce - Hadoop’s fundamental data filtering Data Placement in Heterogeneous Hadoop Clusters

algorithm April 2010

• Apache Mahout - Machine Learning algorithms [12] HDFS Users Guide - Rack Awareness

implemented on Hadoop [13] ", "Cloud analytics: Do we really need to reinvent

• Apache Cassandra - A column-oriented database that the storage stack?"". IBM. 2009-06.

supports access from Hadoop http://www.usenix.org/events/hotcloud09/tech/

• HPCC - LexisNexis Risk Solutions High Performance full_papers/ananthanarayanan.pdf.

Computing Cluster [14] "HADOOP-6330: Integrating IBM General Parallel

• Sector/Sphere - Open source distributed storage and File System implementation of Hadoop Filesystem

processing interface". IBM. 2009-10-23.

• Cloud computing https://issues.apache.org/jira/browse/

• Big data HADOOP-6330.

• Data Intensive Computing [15] "HADOOP-6704: add support for Parascale

filesystem". Parascale. 2010-04-14.

References https://issues.apache.org/jira/browse/

HADOOP-6330.

[1] "Hadoop is a framework for running applications [16] "Replace HDFS with CloudIQ Storage". Appistry,Inc.

on large clusters of commodity hardware. The 2010-07-06. http://www.appistry.com/

Hadoop framework transparently provides community/wiki/display/cloudiq43/

applications both reliability and data motion. Replace+HDFS+with+CloudIQ+Storage.

Hadoop implements a computational paradigm [17] "High Availability Hadoop". HP. 2010-06-09.

named map/reduce, where the application is http://www.slideshare.net/steve_l/high-

divided into many small fragments of work, each of availability-hadoop.



6

From Wikipedia, the free encyclopedia Apache Hadoop





[18] job [42] "Creating Hadoop pe under SGE". Sun

[19] [ https://issues.apache.org/jira/browse/ Microsystems. 2008-01-16. http://blogs.sun.com/

HADOOP-3412] #HADOOP-3412 Refactor the ravee/entry/creating_hadoop_pe_under_sge.

scheduler out of the JobTracker - ASF JIRA [43] "HDFS-Aware Scheduling With Grid Engine". Sun

[20] [2] Hadoop Fair Scheduler Design Document Microsystems. 2009-09-10. http://wikis.sun.com/

[21] [3] Capacity Scheduler Guide display/SunHPC09/

[22] "How 30+ enterprises are using Hadoop", in DBMS2 Sun+HPC+Software+Workshop+’09+Wiki.

[23] Yahoo! Launches World’s Largest Hadoop [44] "OpenSolaris Project: Hadoop Live CD". Sun

Production Application (Hadoop and Distributed Microsystems. 2008-08-29. http://opensolaris.org/

Computing at Yahoo!) os/project/livehadoop/.

[24] Hadoop and Distributed Computing at Yahoo! [45] "Condor integrated with Hadoop’s Map Reduce".

[25] [4] University of Wisconsin–Madison. 2010-04-15.

[26] [5] http://www.cs.wisc.edu/condor/

[27] "Apple Embraces Hadoop". CondorWeek2010/condor-presentations/thain-

http://www.theregister.co.uk/2010/12/01/ condor-hadoop.pdf.

apple_embraces_hadoop/. Retrieved 2011-04-14. [46] Why the Pace of Hadoop Innovation Has to Pick Up

[28] "Using Hadoop to tackle Big Data at comScore". [47] Cloudera’s Distribution including Apache Hadoop

http://www.cloudera.com/videos/ [48] IBM InfoSphere BigInsights

[49] IBM

hw10_video_using_hadoop_to_tackle_big_data_at_comscore. InfoSphere BigInsights Enterprise Edition

[29] "Building a terabyte-scale data cycle at LinkedIn analytics platform enables new class of solutions

with Hadoop and Project Voldemort". for gaining rapid insight through large-scale

http://project-voldemort.com/blog/2009/06/ analysis of diverse data

building-a-1-tb-data-cycle-at-linkedin-with- [50] Platform Computing Announces Support for

hadoop-and-project-voldemort/. Retrieved MapReduce

2011-04-14. [51] MapR Distribution for Apache Hadoop

[30] "Microsoft Expands Data Platform With SQL Server [52] http://mapr.com/products/mapr-editions/

2012, New Investments for Managing Any Data, m5-edition.html

Any Size, Anywhere". http://www.microsoft.com/ [53] Hadoop optimized solutions from SGI

Presspass/press/2011/oct11/10-12PASS1PR.mspx. [54] Greenplum Community

Retrieved 2011-10-13. [55] Greenplum HD: Enterprise-Ready Apache Hadoop

[31] "Use Case Study of Hive/Hadoop". [56] Yahoo! and Benchmark Capital to Form

http://www.slideshare.net/evamtse/hive-user- Hortonworks to Increase Investment in Hadoop

group-presentation-from-netflix-3182010-3483386. Technology and Accelerate Innovation and

Retrieved 2011-04-14. Adoption

[32] "HBase at StumbleUpon". [57] appengine-mapreduce - Google App Engine API for

http://www.stumbleupon.com/devblog/ running MapReduce jobs

hbase_at_stumbleupon/. Retrieved 2010-06-26. [58] Google I/O 2011: App Engine MapReduce on

[33] "Mitula Search/Hadoop". YouTube

http://www.mitula.co.uk. Retrieved 2011-09-06. [59] Oracle Unveils the Oracle Big Data Appliance

[34] http://aws.typepad.com/aws/2008/02/taking- [60] Oracle rolls its own NoSQL and Hadoop

massive.html Running Hadoop on Amazon EC2/S3 [61] http://www.oceansync.com OceanSync.com

[35] Gottfrid, Derek (November 1, 2007). "Self-service, Hadoop Management

Prorated Super Computing Fun!". The New York [62] http://www.grandlogic.com/content/html_docs/

Times. http://open.blogs.nytimes.com/2007/11/ js_features.shtml

01/self-service-prorated-super-computing-fun/ [63] Defining Hadoop

?scp=1&sq=self%20service%20prorated&st=cse. [64] Defining Hadoop Compatibility: revisited

Retrieved May 4, 2010.

[36] Amazon Elastic MapReduce Developer Guide

[37] Amazon Elastic MapReduce Now Supports Spot

Bibliography

Instances • Lam, Chuck (July 28, 2010). Hadoop in Action (1st ed.).

[38] Amazon Elastic MapReduce FAQs Manning Publications. p. 325. ISBN 1-935-18219-6.

[39] Using Spot Instances with EMR on YouTube • Venner, Jason (June 22, 2009). Pro Hadoop (1st ed.).

[40] Google Press Center: Google and IBM Announce Apress. p. 440. ISBN 1-430-21942-4.

University Initiative to Address Internet-Scale http://www.apress.com/book/view/1430219424.

Computing Challenges • White, Tom (June 16, 2009). Hadoop: The Definitive

[41] NSF, Google, IBM form CLuE Guide (1st ed.). O’Reilly Media. p. 524.



7

From Wikipedia, the free encyclopedia Apache Hadoop





ISBN 0-596-52197-9. http://oreilly.com/catalog/ • Introducing Apache Hadoop: The Modern Data

9780596521974. Operating System — lecture given at Stanford

University by Co-Founder and CTO of Cloudera, Amr

External links Awadallah (video archive).



• Official Hadoop Homepage









Retrieved from "http://en.wikipedia.org/w/index.php?title=Apache_Hadoop&oldid=468610972"



Categories:

• Hadoop

• Free software programmed in Java

• Free system software

• Distributed file systems

• Cloud computing

• Cloud infrastructure





This page was last modified on 30 December 2011 at 15:35. Text is available under the Creative Commons Attribution-

ShareAlike License; additional terms may apply. See Terms of use for details. Wikipedia® is a registered trademark of

the Wikimedia Foundation, Inc., a non-profit organization.Contact us

Privacy policy About Wikipedia Disclaimers



8



Related docs
Other docs by roy ashbrook
Philip_Taaffe
Views: 48  |  Downloads: 0
Philip_Dodd__broadcaster_
Views: 36  |  Downloads: 0
Philippa_of_Champagne
Views: 31  |  Downloads: 0
Philadelphians
Views: 26  |  Downloads: 0
Phaansi
Views: 22  |  Downloads: 0
Peykasa
Views: 22  |  Downloads: 0
Pet_door
Views: 41  |  Downloads: 0
Peter_Rice__Chairman_of_Fox_Broadcasting_
Views: 34  |  Downloads: 0
Perittia_farinella
Views: 14  |  Downloads: 0
Perissoza_scripta
Views: 15  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!