GRid_Web by liwenting

VIEWS: 12 PAGES: 111

									Cluster / Grid with Web and
    Semantic Services
    Dr G Sudha Sadasivam
          Professor, CSE
    PSG College of Technology
        Coimbatore- 641 004
•   Web Services
•   SOA
•   Semantics
•   Grid Architecture
•   3rd Generation Grid Architecture
•   Semantic Grid
•   Cluster Architecture- Hadoop
•   Amazon Web Services
•   Work at Grid and Cloud Computing Lab -
                                1. Web Service
A service is a set of actions that form a coherent whole from the point of view of
service providers and service requesters - Arranging for a birthday party.

Web services provide a standard means of interoperating between different
software applications, running on a variety of platforms and/or frameworks in a
transparent and loosely coupled manner

A Web service is a software system designed
   • to support interoperable machine-to-machine interaction
   • has an interface described in a machine-processable format (WSDL).
   • communication using standard SOAP-messages, on HTTP
   • with an XML serialization in conjunction with other Web-related std.
   • UDDI registry
   • identified by URI
Web service is an entity that can be:
                • Described (using WSDL)
                • Published
                • Discovered
                • Invoked by a client
W3C technology standardization process
Web Service Interactions
• A Web service is an abstract notion that is implemented by a
  concrete agent.
• Elements
   – The provider entity is the person or organization that provides an
     appropriate agent to implement a particular service.
   – A requester entity is a person or organization that wishes to make use of
     a provider entity's Web service.
   – Registry – to register the services

• Web Service Discovery:
   – Before message exchange, the requester entity and the provider entity
     must first agree on both the semantics and the mechanics of the
     message exchange
   – The service description (WSD) (message formats, datatypes, transport
     protocols, and transport serialization formats) represents a contract
     governing the mechanics of interacting with a particular service.
   – The semantics represents a contract governing the meaning
     (consequence and purpose) of that interaction.
                            2. SOA
• Aim: Alignment of Business needs with IT
• Architectural style of building enterprise solutions based on
• SOA is a blueprint that governs creation, deployment,
  execution and management of reusable business services.
• WSA is an instance of SOA (Architecture – independent of
• Services provide independent, loosely coupled, transparent,
  composable invocation of tasks in a standard way.

• SOA separates functions into distinct units (services),
  which can be distributed over a network and can be
  combined and reused to create business applications.
  These services communicate with each other by passing
  data from one service to another, or by coordinating an
  activity between two or more services.

• Guiding principles – Reusability, Open standards
Alignment of Business needs with IT
                Text in blue and black can be changed
                Positions in Blue text cannot be altered
                Text in black can be altered in position and sizes if need be
                Text in yellow ochre is meant for legal matters and updates

Requirements mapping to architectural principles
Guiding Principles                                Realization
 Ability to model and execute                     Business process modeling and
  the business processes                            BPEL4WS
 Common shared services
  across various applications                      Service Oriented Architecture
 Facilitate integration
  between airlines.
             services                              Enterprise Service Bus
 Consistent UI for all
  integrating airlines
                                                   Portal Framework
 Independently scale the
  components of the                                Decoupling of the layers
  architecture                                      independently

Architecture Style Employed: SOA Mediator Pattern with ESB Broker variation


                    Page number in yellow (position not to be changed)
• Services created using an SOA and provided by an
  organisation’s IT should directly support the services that
  the organisation provides to its customers. (BP – IT)

                                      Human-               Self-            System-
                                      mediated             service          system
  SO business delivers                service                               delivery
  services to its customers                                                 service

SOA is a blueprint that governs                  Service Oriented architecture
creation, deployment, execution and
management of reusable business
services                              contract             contract        contract
It aligns Business and Technology                          New             Composite
                                                           system          system
                   SOA roles
• Business Role: SOA is viewed as a set of services
  that a business wants to expose to customers and
• Architectural Role: SOA is an architectural style
  which requires a service provider, requestor and a
  service description. It provides services that fosters
  modularity, encapsulation, loose coupling, separation
  of concerns, reuse, composable and single
• Implementation Role: SOA is a complete
  programming model (process) with standards, tools,
  methods and techniques, technologies.
                        SOA suite
                        Model and
                        Capture business
                        processes and
Activity monitoring                         Integrate the
to gain real-time                           services using
information on BP                           ESB and
                                            orchestrate the
                             SOA            services into BP

                                             Develop, connect
Apply runtime                                and bind services to
policies to                                  build composite
services and                                 applications
                      Deploy composite
govern them           applications and to
                      perform service
                      level management
• A service is a manager entity that consists of a collection of components
  that work together to deliver the business function (currency
  conversion/airline reservations)
• A service maps to a business function but a component maps to business
  entities and the business rules that operate on them.
• Bank teller application
   – components - loan component, savings bank component (with
      withdrawal / deposit), account manager (to create new accounts).
   – Service - the interfaces of all components (group) can be composed
      and exposed as services - creation of new accounts, withdrawal and
      deposit services and loan service.

                      SERVICE DESCRIPTION

               CHOREOGRAPHY    DYNAMIC
        UI, Business processes, Service Layer,
            Component Layer, Object Layer
PRESENTATION – portal for aggregation
       of contents to users

                                        Business Process Layer
                                        Automation logic
                                        Orchestration of services.
                                        Service layer – collection
                                        of units of work (interfaces)
                                        Processing logic
                                        Component layer – operations
                                        that are units of work.

                                        Object layer / legacy –
                                        Messages for
                    Terms in SOA
•   Services
•   Service provider
•   Service consumer (or Service requestor
•   Service locator or service registry
•   Service broker – passes service requests to one or
    more service providers.
                           SOA LIFE CYCLE
                           Expose                     CREATION OF
Business                                              EXISTING / new
Drivers                    Incremental                COMPONENTS

                Consume             Compose             COMBINE

                                                      USE SERVICES

  Consumer view :
                                         Provider view :
  Service identification
                                         Component identification
  Service Categorisation
                                         Component Specification
  Service exposure
                                         Service realisation
                                         Service management
                                         Standards Implementation
•   standardisation
•   Faster time to market
•   Operational efficiency and adaptability
•   Agility to collaborate
•   Continuous improvement
•   Aligns business to IT
•   Ease of introducing new technologies
•   Return of Investment (ROI)
•   Vendor diversity
•   Services – encap, loose coupling, contract,
    reuse, composability, autonomous, dynamic,
    higher granularity
                     Business Process

Service Registry

                    Service Description

                   Service Communication
                       Protocol (ESB)

                      Transport layer
    Problems in Web services (Point – Point)
 • Service consumers need to be modified whenever the service
   provider interface changes. (dynamic)
 • Every consumer should have a suitable protocol adapter for each
   provider it is connected to. (interoperability)

• ESB acts as a mediator that transforms, routes, notifies and augments
• It provides virtualization of the enterprise resources.
• The Enterprise Service Bus is an enterprise-class messaging bus.
• It has the following facilities:
                  messaging infrastructure
                  message transformation facility between consumer and
                  Content-based routing between service consumers and
                  Capability to convert transport protocols between
                  consumer and provider.
                          SOA based Web services
                              Business Process (BPEL)

                                                                                                                                        Management (WSManageability)
                                                                                                          Transaction (WSTransaction)
Service Registry (UDDI)


                                                                                  Security (WSSecurity)
                                                              Policy (WSPolicy)
                          Service Description (XML, WSDL)

                           Service Communication Protocol

                          Transport layer (HTTP, JMS, SMTP)
  SAHANA                   Responders

     Office Systems         Laptop/PDA/Cell                Web Client

             Wired                 Mobile                 Internet
Channel Access


Person         Org       Camp          Requests         Shelter        SMS

Family       Services    Person             Aids         Place       Alerts

Search          Vol      Search         Match
                                                   Search procedures
                                   DDoS and Load Balancing

 Missing                   Camps             Request       Shelter
               Org Reg                                                       Mobile
 person                     Reg               Mgmt          Reg

• Missing person’s registry with efficient search
• Organisation registry with efficient match and
  volunteer coordination
• Camps registry
• Request management registry with inventory
  management and optimisation – search
• Shelter registry
• Messaging alerts
• Damages registry
• Grid management module to manage
  coordination efforts among districts and relief
• Bulletin board – user area
          SOA – screen shots
        1. Organisation Registry

• New Organization Registration with the
• Maintaining details about each
  organization with unique ID
• Updating Organization’s services
• When a Organization wants to provide service it
  must provide the Organization name, city, branch to
  the system
• By Default, every Organization that registers for the
  first time has to provide a single service
• On successful registration, an automatically
  generated Organization Id will be displayed to the
  Organization authority
• To update the service provided, both Organization
  ID and password are validated
• The various services are displayed in the form and
  from which Service provider have to select their
  additional service

                   ORG NAME           REGISTRATIO
PROVIDER                                   N
                    CITY                SYSTEM




                   ORGANIZATION ID

             ORG ID AND PASSWORD             REGISTRATION
PROVIDER                                        SYSTEM
             RECORD RETRIVAL

                 VALIDATION RESULT

                  SERVICES LIST

                 SELECTED SERVICE


             UPDATED FORM
• Service Provider registers to the system
• Service provider login validation
• Services updating
3 X Forms
            High Throughput Computing                           High Performance Computing
                                                          Tightly coupled, fine grain parallelism
  Distributed Computing, loosely coupled                  Homogenous Systems
  Disparate Autonomous heterogenous systems               high computing power, short period
  Computation intensive – Sharing , single adm            Low latency communication

                P2P                                Clusters                   Shared Memory Computing
Mainly for file sharing            Resource sharing                           Parallel systems, multicore
Geographically dispersed peers     Close to each other,                       Divide and conquer
Autonomous nodes                   Usually homogenous                         synchronization
Decentralised                      Centralised control, cooperative working   Tightly Coupled

                        GRID                                                  CLOUD
Heterogeneous systems, HTC                             Heterogeneous systems , HPC
VO – trust groups, dynamic, cross organisational       On demand resource provisioning over Internet
Geographically dispersed Resource sharing              Data centric with grid backbone, utility value
Scientific, distribution of work among all resources   Elastic , Business, full utilization of resources

      Virtualisation                           Web Services                            Virtualisation
System integration                      Application integration                  Viewing a single system as
                                        Separation of concerns                   multiple resources
                                        Data integration, interop                         Multi tenancy
                                                                                    Sharing a resource
                                                                                    among multiple clients
Some Characteristics of Grids
  Owned by multiple                     Connected by
    organizations &                     heterogeneous,
        individuals                     multi-level networks

Different security                         Different resource
    requirements                           management
       & policies                          policies

           Unreliable                   Geographically
       resources and                    separated
                        Resources are
  Stages to using the Grid – Classical
    write (code) to solve problem
                        “compile” against middleware

                              submit to Grid      security

        Stage data

                          Deploy to
 Steering and                                  Select
 visualisation                                 resources
         Technical capabilities
• Resource modeling
• Monitoring and notification
• Allocation
• Provisioning, life-cycle management, and
• Accounting and auditing
• security
              Overall GRID Architecture                                     G2



                   Connectivity                            Transport
                       Fabric                                 Link

   2/2/2010    Source: The Anatomy of the GRID, Foster, Kesselman and Teucke43

Fabric layer: Provides the resources for shared access
Connectivity layer: Core communication and authentication protocols
Resource layer: Protocols for secure negotiations, initiation, monitoring
control, accounting on individual resources.
Collective Layer: Protocols and services to capture interactions among a
collection of resources.
Application Layer: User applications that operate within VO environment.
                 G3- Services - OGSA
• Service based infrastructure for grid
• Grid aims to integrate, virtualize, and manage resources
   and services within distributed, heterogeneous, dynamic
  “virtual organizations”
• Standardization is critical to create interoperable, portable,
secure robust, scalable and reusable components and
• Goal is to standardize grid services by specifying set of
standard interfaces.
• Aims to develop a common , standard and open architecture
for grid based applications.
• Service-oriented architecture, based the Open Grid
Services Architecture (OGSA), addresses this need for
standardization by defining a set of core capabilities and
behaviors that address key concerns in Grid systems.
• OGSA is based on Grid Service ( extension of web service) .
• OGSA realizes the logical middle layer in terms of
  services, the interfaces these services expose, the
  individual and collective state of resources belonging to
  these services, and the interaction between these
  services within a service-oriented architecture (SOA).
• The architecture is not layered,
• Services are loosely coupled peers that, either work
  single or part of an interacting group of services,
• Requirements not met in Web services were implemented
  as Grid services confirming to OGSI specifications
• OGSI specification defines
   – How grid service instances are named and referenced
   – How the interfaces and behaviors are common to all
      Grid services
   – How to specify additional interfaces, behaviors and
• Introduces Service Data Elements (SDEs)
• portType inheritance
• Grid Service Handle (GSH)
• Grid Service Reference (GSR)
• Factory
• Handle resolver
• Notification
• Service groups (light-weight registries)
Service relationships
Grid vs Web services

• Web Services
   • Messages exchange
   • Documents
   • No notion of “pointer”
   • Service orientation?
• Grid Services
   • The architecture encourages everything to be exposed
      through an interface rather than being sent as a
   • GSH is the “pointer”
   • Object orientation? (CORBA?)
• 2-level naming scheme – GSH and GSR
• SDE – Web services static discovery vs SDE –
• Instantiation and life cycle management - factory

        2. CREATE

                 G4- Grid WSRF
OGSA services defined and implemented as Web
        Grid Computing : Transition from OGSI to
                3. Semantic Web

• information management
   – Keywords,
   – Statistical,
   – Natural Language,
   – Semantic Web
• Semantic Web architecture
  – automated conversion and storage of unstructured text
    machine process able format
  – automatically extract and process the concepts and
    context in the database –uses intelligent techniques
  – Uses metadata to capture meaning of the information
To capture Knowledge
• Metadata
• Ontology –
   – formal specification of information
   – A network of concepts, relationships, and constraints
     that provide context for data and information as well as
   – classes (concepts) and relationships (hierarchy) in the
     domain. It provides a shared understanding of the
   – Ontology languages - XML, RDF, OWL
• Logic –
   – formal languages for representing knowledge with
   – Reasoners to infer conclusions
• Agents
   – Pieces of software that work autonomously and
   – Eg- search personalisation
Semantic Web Architecture
• Unicode
   – International encoding standard
   – Any language can be used on the web using one
     standardized form.
• Uniform Resource Identifier (URI)
   – uniquely identify resources (e.g., doc)
   – URL+URN
   – language to write structured web documents with user
     defined vocabulary
   – To send documents across the Web
   – Data model (representation) of web objects
   – XML based syntax
   – Has modeling to organise web objects into hierarchies
     (taxonomies) – class, subclass, properties, domain and
     range restriction
   – Based on RDF
   – Used to write ontology
• Logic Layer
   – Application specific declarative knowledge – RIF and
• Proof layer
   – Deductive process
   – SPARQL can be used for querying ontologies and
     knowledge bases – SQL like
• Trust layer
   – Users trust using Web services
• triples subject-predicate-object in RDF
• Joe Smith has homepage
   – (subject)
      is intended to identify Joe Smith
   – (predicate)
   – (object) is Joe's homepage
"Joe has family name Smith"

RDF graph describing Joe Smith
RDFS for the
company ( resource)
identified by URI;
Name is Webify Solutions,
e-mail address is, and
phone number is 1-800-4WEBIFY.
• Classes - named class, intersection classes, union
  classes, complement classes, restrictions, and
  enumerated classes
• Properties
   – Object type
   – Data type
   – Property types
       • Functional
       • Inverse functional
       • Symmetric
       • Transitive
• Individuals – instances of classes and properties relate
                     Need for ontology in IT
• Bank
   – Offers a number of services which can use the same data but with
   – New services can be added – but reuse existing data / functionality
• An ontology-driven approach
   – can capture and represent its total product knowledge in a
     language-neutral form
   – deploy the knowledge in a central repository (shared).
   – a single, unified view of data across its applications.
   – precise retrieval of information and seamless enterprise
   – business processes and various data sources can map to
     each other through a common meta-model.
   – shared ontology
       •   eliminates point-to point integration
       •   simplifies application integration
       •   reduces data redundancy and
       •   provides the same semantic meaning across applications,
       •   eases the bank's maintenance and upgrades.
            Need for semantic web

– WWW has vast amount of heterogenous information
   • Searching is based on contents
   • Semantic meaning attached to content items describes
     the information precisely
   • Relevancy of information extraction can be improved.
– Provided services can be tagged with meaning;
   • Web-based software agents can dynamically find these
     services on the fly and use them to your benefit or in
     collaboration with other services.
            Need for semantics in SOA

• In SOA service representations of the available services
  must be maintained.
   – Metadata to discover and organize services
   – Metadata to model and assemble services
   – metadata to encapsulate business logic for dynamic
   – Metadata manage with metadata.
• Ontology provide a very powerful and flexible way to
  aggregate, visualize, and normalize service metadata
• Ontology enhance service discovery, modeling, assembly,
  mediation, and semantic interoperability
• Semantic technologies provide an abstraction layer above
  existing IT technologies, one that enables the bridging and
  interconnection of data, content, and processes across
  business and IT silos.
              Semantics for Business
• A business ontology is a formal specification of business
  concepts and their interrelationships that facilitates
  machine reasoning and inference.
• A business ontology ties systems together using
  metadata, much as a database ties together discrete
  pieces of data.
• Organizations can provide a single, unified view of data
  across their applications,
• Allows for precise retrieval of information,
• simplifies enterprise and SOA integration,
• reduces data redundancy, and
• Provides uniform semantic meaning across applications.
• eases development, maintenance, and upgrades across
  the enterprise.
                       Grid semantics

• The Grid’s vision - sharing diverse resources in a flexible,
  coordinated and secure manner through dynamic formation and
  disbanding of virtual communities, strongly depends on metadata.
  Ad hoc expression and use of metadata causes chronic dependency
  on human intervention
• The Semantic Grid is an extension of the Grid in which rich resource
  metadata is exposed and handled explicitly, and shared and
  managed via Grid protocols.
• It exposes semantically rich information associated with grid
  resources to build more intelligent grid services
• The layering of an explicit semantic infrastructure over the Grid
  Infrastructure leads to increased interoperability and greater
• Reference Architecture that extends OGSA (standardisation) to
  support the explicit handling of semantics, and defines the
  associated knowledge services to support a spectrum of service
• S-OGSA defines a model (abstraction), the capabilities (what) and
  the mechanisms (how) for the Semantic Grid.
• Metadata – to label grid resources and
  entities with concepts (data file according
  to appln domain)
• Rules and classification-based reasoning
  can be used to generate new metadata
  from existing metadata. (VO membership)
• S-OGSA has
  – Model (elements and relationships)
  – Capabilities (services for the components)
  – Mechanisms (elements to deliver the service)
       S-OGSA entities and relationships
• Grid entities (id in grid)
• Knowledge entities (K-entities) – Grid entities to operate
  on knowledge.
• Semantic Bindings – association between grid and
  knowledge entities.
• Semantic grid entities – entities subject to semantic
  bindings, or semantic bindings, knowledge entity.
• Fabric layer – resources are virtualised
  through Web services
• Grid middleware with services – OGSA
  interact with one another. It deploys web
  services with port types through which
  resources are accessed
• OGSA is extended with light weight
  semantics and knowledge services to
  support a spectrum of service capabilities
• Top – application layer
• Semantics of middleware and fabric layers
  are considered.
• Services
  – Semantic provisioning services
    • Knowledge provisioning services
    • Semantic binding provisioning services
  – Semantic aware grid services
    • Consume semantic bindings and take actions
      based on knowledge and metadata
       Semantic aware authorisation service

Subject – John Doe, object – resource
Semantic bindings based on match
Ontology service provides knowledge to understand semantic bindings
What is Hadoop?
  It's a framework for running applications on large clusters of

   commodity hardware which produces huge data and to
   process it
  Apache Software Foundation Project

  Open source

  Amazon’s EC2, Google

  alpha (0.21) release available for download

Hadoop Includes
  HDFS - a distributed filesystem

  Map/Reduce - HDFS implements this programming model. It

   is an offline computing engine

Moving computation is more efficient than moving large
• Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+
   – One computer can read 30-35 MB/sec from disk
     ~four months to read the web
   – same problem with 1000 machines, < 10 mins
Single-thread performance doesn’t matter
We have large problems and total throughput/price more
    important than peak performance
Stuff Breaks – more reliability
• If you have one server, it may stay up three years (1,000 days)
• If you have 10,000 servers, expect to lose ten a day
“Ultra-reliable” hardware doesn’t really help
At large scales, super-fancy reliable hardware still fails, albeit
    less often software still needs to be fault-tolerant

Commodity machines without fancy hardware give better price
  – performance ratio.

           Fundamental Dynamics
           (Pace of change of the digital
                Digital power =
computing x communication x storage         x content

Moore’s law        fiber law    disk law    community

doubles             doubles      doubles       n
every 18       x    every 9    x every 12   x 2
months              months       months      where n is
                                             # people
(Source: Ian Foster’s Talk)
            HDFS Why? Seek vs Transfer

BTree (Relational DBS)
  – operate at seek rate, log(N) seeks/access
  -- memory / stream based
sort/merge flat files (MapReduce)
  – operate at transfer rate, log(N) transfers/sort
 -- Batch based

• Fault tolerant, scalable, Efficient, reliable distributed
  storage system
• Moving computation to place of data
• Single cluster with computation and data.
• Process huge amounts of data.
• Scalable: store and process petabytes of data.
• Economical
• Data Model
   – Data is organized into files and directories
   – Files are divided into uniform sized blocks and
     distributed across cluster nodes
   – Replicate blocks to handle hardware failure
   – Checksums of data for corruption detection
     and recovery
   – Expose block placement so that computes
     can be migrated to data
• large streaming reads and small random reads
• Files are broken in to large blocks.
   – Typically 128 MB block size
   – Blocks are replicated for reliability
   – One replica on local node,
      another replica on a remote rack,
      Third replica on local rack,
      Additional replicas are randomly placed
• Understands rack locality
   – Data placement exposed so that computation can be
      migrated to data
• Client talks to both NameNode and DataNodes
   – Data is not sent through the namenode, clients
      access data directly from DataNode
   – Throughput of file system scales nearly linearly with
      the number of nodes.
Block Placement
Hadoop Cluster Architecture:
• DFS Master “Namenode”
  – Manages the file system namespace
  – Controls read/write access to files
  – Manages block replication
  – Checkpoints namespace and journals
    namespace changes for reliability

Metadata of Name node in Memory
  – The entire metadata is in main memory
  – No demand paging of FS metadata

Types of Metadata:
  List of files, file and chunk namespaces; list of
    blocks, location of replicas; file attributes etc.
• Serve read/write requests from clients
• Perform replication tasks upon instruction by
Data nodes act as:
1) A Block Server
   – Stores data in the local file system
   – Stores metadata of a block (e.g. CRC)
   – Serves data and metadata to Clients
2) Block Report: Periodically sends a report of all
  existing blocks to the NameNode
3) Periodically sends heartbeat to NameNode (detect
  node failures)
4) Facilitates Pipelining of Data (to other specified
• Map/Reduce Master “Jobtracker”
  – Accepts MR jobs submitted by users
  – Assigns Map and Reduce tasks to Tasktrackers
  – Monitors task and tasktracker status,
    re-executes tasks upon failure
• Map/Reduce Slaves “Tasktrackers”
  – Run Map and Reduce tasks upon instruction
    from the Jobtracker
  – Manage storage and transmission of
    intermediate output.

• Copies FsImage and Transaction Log from
  NameNode to a temporary directory
• Merges FSImage and Transaction Log into
  a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
  – Transaction Log on NameNode is purged
                        HDFS Architecture

• NameNode: filename, offset-> block-id, block -> datanode
• DataNode: maps block -> local disk
• Secondary NameNode: periodically merges edit logs
Block is also called chunk
                Software Model - ???
• Parallel programming improves performance and
• In a parallel program, the processing is broken up into
  parts, each of which can be executed concurrently
• Identify whether the problem can be parallelised (fib)
• Matrix operations with independency
               CALCULATING PI
                       The area of the square, denoted As
                         = (2r)^2 or 4r^2.
                       The area of the circle, denoted Ac, is
                         pi * r2.
                       • pi= 4 * No of pts on the circle /
                         num of points on the square
                       • Count the number of generated
                         points that are both in the circle
                         and in the square  MAP
                       • PI = 4 * r  REDUCE

• Restricted parallel programming model meant
  for large clusters
  – User implements Map() and Reduce()
• File
  Hello World Bye World
  Hello Hadoop GoodBye Hadoop
• Map
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
• The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
The output of the first combine:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second combine:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Thus the output of the job (reduce) is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
• Map()
  – Input <filename, file text>
  – Parses file and emits <word, count> pairs
    • eg. <”hello”, 1>
• Reduce()
  – Sums all values for the same key and emits
    <word, TotalCount>
    • eg. <”hello”, (3 5 2 7)> => <”hello”, 17>
• File
  Hello World Bye World
  Hello Hadoop GoodBye Hadoop
• Map
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
• The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
                  MR model
• Map()
   – Process a key/value pair to generate
     intermediate key/value pairs
• Reduce()
   – Merge all intermediate values associated with
     the same key
• Users implement interface of two primary methods:
      1. Map: (key1, val1) → (key2, val2)
      2. Reduce: (key2, [val2]) → [val3]
• Map - clause group-by (for Key) of an aggregate
  function of SQL
• Reduce - aggregate function (e.g., average) that is
  computed over all the rows with the same group-by
  attribute (key).
                     Cloud need
• ‘Era of tera’
   – ever-growing datasets,
   – Changing demands/loads
   – unpredictable traffic patterns, and
   – the demand for faster response times.
• Elasticity – use and relinquish resources as per demand
• Software applications should be internet accessible
• Large scale applications –
       cloud provides large number of machines, when
  needed, distributes work among them, provisions new
  machines on failure, auto scale, relinquish machines
  when not needed
• Almost zero upfront infrastructure investment
• Just-in-time Infrastructure
• More efficient resource utilization
• Usage-based costing
• Potential for shrinking the processing time
• Less time for development
Basis – automated elasticity - on-demand and
  elastic nature
Example – e-ticketing application
The Amazon Web Services (AWS) cloud provides a highly reliable
  and scalable infrastructure for deploying web-scale solutions,
  with minimal support and administration costs, and good
• Amazon Elastic Compute Cloud (Amazon EC2) is a web
  service that provides resizable compute capacity in the cloud.
   • Operating system, application software and associated configuration
     settings can be bundled in an Amazon Machine Image (AMI).
   • Scale up / down is done by provisioning / decommissioning multiple
     instances using simple web service calls
   • On-Demand Instances / Reserve instances / Spot Instances
• Amazon S3 to retrieve/store input /output datasets.
   – store / retrieve large amounts of data as objects in buckets (containers)
     on the web using standard HTTP
   – Copies can be made in 14 locations using CloudFront
• Amazon Simple Queue Service (Amazon SQS) is a reliable,
  highly scalable, distributed queue for storing messages as they
  travel between computers and application components
• Amazon SimpleDB is a web service for real-time lookup
  and simple querying of structured data
• Amazon Relational Database Service (Amazon RDS)
  provides an easy way to setup, operate and scale a
  relational database in the cloud
• On-demand hadoop cluster- distributed processing,
  automatic parallelization, and job scheduling
• Amazon Elastic MapReduce provides a hosted Hadoop
  framework running on the web-scale infrastructure of
  Amazon Elastic Compute Cloud
• Amazon Virtual Private Cloud (Amazon VPC) extends
  corporate network into a private cloud contained within
• Availability Zones are distinct locations engineered to be
  insulated from failures in other Availability Zones and provide
  inexpensive, low latency network connectivity to other
  Availability Zones in the same Region.
• Elastic IP addresses allocates a static IP address and
  programmatically assigns it to an instance.
• CloudWatch can monitor an Amazon EC2 instance for resource
  utilization, operational performance, and overall demand
  patterns .
• Auto scaling feature to create Auto-scaling Group.
• Incoming traffic can be distributed using elastic load balancing
• Amazon Elastic Block Storage (EBS) volumes provide network-
  attached persistent storage to Amazon EC2 instances.
• AWS offers payment and billing services.
• Amazon CloudFront. provides a high performance, globally
  distributed content delivery system
GrepTheWeb Application
         Cloud Services best practices
• Design for failure and nothing will fail - design,
  implement and deploy for automated recovery from
• In AWS
   – Failover gracefully using Elastic IPs
   – Utilize multiple Availability Zones
   – Maintain an Amazon Machine Image
   – Utilize Amazon CloudWatch
• Decouple the components – based on SOA design
  principle of the loosely coupled the components for
   – Message queues: If one component fails the system
      will buffer the messages and get them processed when
      the component comes back up.
1)   SQS for decoupling and buffering
2)   Service interfaces for components
3)   AMI created
4)   Stateless applications
• Implement elasticity
• Think parallel
The beauty of the cloud shines when you
  combine elasticity and parallelization
• Keep dynamic data closer to the
  compute and static data closer to the
PSG-Yahoo Grid and Cloud Computing Lab
             2008 till date
• 54 rack servers – SC145 & PowerEdge
• 40 end connectors
• 10 client nodes
• Hadoop
• Globus
• OpenVZ
• Xen
•   Courses conducted – 10
•   Papers published – 11
•   Internship – 3
•   Placement – 3
•   PhD – 4
•   Conference talks - 3
• An Efficient Approach to Task Scheduling in
  Computational Grids
• Data Discovery in Grid using Content Based Searching
• P2P Information Retrieval Framework for Digital Library
  System using Hadoop DFS.
• Integration of Xen and Hadoop framework
• DNA sequencing using hadoop data grids
• DNA sequencing in public clouds
• Virtualisation – using Xen and Open VZ- a comparison
  of performance
• Grid Security – a tree based dynamic approach
• Study of some existing scheduling algorithms
• Grid Task Scheduling using PPSO
• Content based Image Retrieval
• Modification of fairshare scheduling in Hadoop
• Two level scheduler for clouds
• Hybrid Search using content based and semantic

To top