Google OS

Document Sample
Google OS
Description

Anatomy of Google Service Platform

Anatomy of Google Service Platform

March 2, 2007 Jaesun Han (jshan0000@gmail.com) Contact : http://www.web2hub.com



Contents

Overview of Web 2.0 Technologies Google Service Platform

Google File System(GFS) Bigtable MapReduce Chubby



Case Study: Google Services over Platform

Google Analytics Google Earth Personalized Search



Web 2.0 Technology Map



Web 2.0 Technology Layer

Client Layer Front-End Layer Data Processing Layer

Raw Data External Data Source Distributed/Parallel Processing



Platform Layer

Distributed Storage Distributed File System



Processed Data



DB

XHTML, CSS, PHP, Python, RSS,Atom, Microformats, Ruby, RoR, OpenAPI, RIA Dojo, DWR, REST,JSON, (Ajax, Flex, Atlas, GWT, SOAP, XUL, XAML, Apache, Mashup Gadget) PHP, MySQL Recommendation (Collaborative Filtering) Ranking, Clustering, Data mining, Personalization, Social Network Analysis



Cluster Computing, Beowulf, Grid, Globus, Condor, P2P, DHT, MPI, Utility Computing, Virtualization, Autonomous Computing



Cluster Management



Google Service Platform



Platform Layer

Client Layer Front-End Layer Data Processing Layer

Raw Data External Data Source Distributed/Parallel Processing



Platform Layer

Distributed Storage Distributed File System



Processed Data



DB

XHTML, CSS, PHP, Python, RSS,Atom, Microformats, Ruby, RoR, OpenAPI, RIA Dojo, DWR, REST,JSON, (Ajax, Flex, Atlas, GWT, SOAP, XUL, XAML, Apache, Mashup Gadget) PHP, MySQL Recommendation (Collaborative Filtering) Ranking, Clustering, Data mining, Personalization, Social Network



Cluster Computing, Beowulf, Grid, Globus, Condor, P2P, DHT, MPI, Utility Computing, Virtualization, Autonomous Computing



Cluster Management



Google Service Platform

Services Service Library

Google OS Google Service Platform

Service Software 기술

Search engine, Email server, IM server, Map database, Various Web sites …



System Software 기술



Google Linux, Google File System, MapReduce Library, Chubby, BigTable Intelligent System, Programming Model(River, TACC), Replication/Redundancy …



Hardware 기술



Google Cluster



Clusters, Geographic distribution, Automated Setup, Automated Backup, Standard components, Commodity drives, Flexible co-location, Easy-access design …



• 450,000 or more servers (NYT) • All PC servers less than $1,000 • 40 or more pizza box servers per rack



Advantages



• Easy Development • Scalability • Robustness



Google Service Platform

Computation



Distributed data processing library



MapReduce (OSDI 2006)



Bigtable (OSDI 2006)

Distributed storage system for structured data



Storage



Distributed File System

(OSDI 2006) Distributed Lock Manager



GFS (SOSP 2003)



Chubby



GFS



GFS: Overview

Scalable distributed file system for large distributed data-intensive applications

Running on inexpensive commodity hardware Delivering high aggregate performance to a large number of clients



Features

user-level distributed file system centralized architecture

metadata: client a single master data: client chunkservers



64MB fixed large chunk size non-standard file system interface (not POSIX API)

create, delete, open, close, read, write, snapshot and record append



three replicas of a chunk no client caching (but caching metadata like chunk location)



GFS: Architecture

lookup table map In-Memory Data Structure • file and chunk namespaces • mapping from files to chunks • chunk locations operation log File creation/deletion File renaming Chunk addition/deletion Separation of control flow and data flow



File read/write



GFS: Write



pipelined data delivery to fully utilize each machine’s network bandwidth



primary and replicas locations



primary lease (initial timeout=60s) ordering write requests for the same chunk



GFS: Relaxed Consistency

write from 2



chunk1 A B

write from 4 0 1 2 3 4 5 6 7



chunk2



replicas of chunk2



case1

(chunk2: B -> A)



0 1 2 3



4 5 6 7



Undefined

case2

(chunk2: A -> B) 0 1 2 3 4 5 6 7



Consistent



GFS: Atomic Record Appends

write from 1



chunk1

0 1 2 3



Record Append

chunk1 A B

0 1 2 3



A B

write from 2 0 1 2 3 0 1 2 3



max record size : ¼ of the max chunk size ( padding)



error in writing B



case1

(B -> A)



case1

(B -> A)



0 1 2 3



replicas of chunk1



case2

(A -> B)



case2

(A -> B)



0 1 2 3



writing at the exact offset



Bigtable



Bigtable: Overview

Motivation

Lots of structured and semi-structured data

web crawl data, satellite image, user data, email, …



No commercial system big enough



Bigtable

Distributed storage system for structured data A sparse, distributed, persistent multi-dimensional sorted map Goals

wide applicability, scalability, high performance, and high availability



Target workloads

from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end-uses



Applications

more than 60 Google products and projects (Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth)



Bigtable: Data Model

Table

Column Family: the basic unit of access control



Timestamp



(Column Family:Qualifier) Column Key



Tablet



Row Key

Atomic read/write for a single row key



Tablets

the unit of distribution and load balancing



Indexing



(row:string, column:string, time:int64) string (com.cnn.www, anchor:my.look.ca, t8) “CNN.com”



Tablet com.web2hub.www



Bigtable: API

// Open the table Table *T = OpenOrDie(“/bigtable/web/webtable”); // Write a new anchor and delete an old anchor RowMutation r1(T, “com.cnn.www”); r1.Set(“anchor:www.c-span.org”, “CNN”); r1.Delete(“anchor:www.abc.com”); Operation op; Apply(&op, &r1) Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily(“anchor”); stream->SetReturnAllVersions(); scanner.Lookup(“com.cnn.www”); for (; !stream->Done(); stream->Next()) { printf(“%s %s %lld %s\n”, scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); }



Writing to Bigtable Metadata operations



Reading from Bigtable



Create/delete tables and column families, change metadata



Several other features of API

single-row transactions: atomic read-modify-write sequences execution of client-supplied scripts (written in Sawzall)



Bigtable: SSTable

Used internally to store Bigtable data Immutable, sorted file of key-value pairs data blocks + an index

block size is 64KB, but configurable an index is used to locate blocks loaded into memory when the SSTable is opened

key-value

64K block 64K block



64K block



index



Bigtable: Tablet & Locality Group

Locality Group1 contents Locality Group2 anchor Locality Group3 language checksum



Tablet

com.cnn.www

(abc.html ~ help.html) (100~200MB)



SSTable1

(100MB)



SSTable2

(50MB)



SSTable3

(30MB)



GFS chunks



64MB



64MB



64MB



64MB



Bigtable: Tablet Location

Features

Three-level hierarchy (served on tablet servers, not the master) client library’s caching and prefetching of tablet locations



row key (tablet’s table id + its end row) tablet location ex) webtable:com.cnn.www



128MB = 217x1KB row



Addressing 234 tablets



Bigtable: Tablet Assignment

Cluster Management System Tablet Servers



new exclusive lock /servers/tab_svr10



1) start a server



Tablet Server (tab_svr10)



8) 9) reassign ac de unassigned a let qui k e 4) t e re loc tablets ea m th an cr he on e t 2) lo c d ito re i u k r/ cq a se 3) rv er s 5) assign tablets Bigtable

k oc l



Chubby



6) check lock status



Master



7) failure or losing lock



Bigtable: Master Failure

Tablet Changes

Create Delete Merge Split initiated by master initiated by tablet server master lock /servers/master 5) reassign unassigned Chubby tablets 1) 3) check ac m assigned as qu te ire 2) & rl tablets oc ge sca k tl n/ ive se se rve rv rs er l is Bigtable t 4) scan METADATA tablets



Tablet Servers



Master



0) start a master



Bigtable: Read/Write

a single commit log per tablet server anchor v4.0



read on a merged view



memtable (sorted buffer)

anchor:www.abc.com ABC anchor:www.abc.com null anchor:www.c-span.org CNN



anchor v3.0 t1: Set(“anchor:www.c-span.org”, “CNN”) t2: Delete(“anchor:www.abc.com”) t3: Set(“anchor:www.abc.com”, “ABC”)



anchor v2.0



anchor v1.0



Fast writing: mutation is logged in memory Efficient reading: a merged view of sorted data structures



Bigtable: Compactions

v5.0



v4.0



v3.0



v2.0



v1.0



memtable



minor compaction



A new SSTable

v6.0



memtable + all SSTables Only one SSTable



major compaction



MapReduce



MapReduce: Overview

Motivation

Input data is large Lots of machines: hundreds of thousands of PC servers



MapReduce

Programming model and implementation for parallel processing large data sets parallelization, fault-tolerance, data distribution, and load balancing in a MapReduce library map & reduce functions

map (k1, v1) list (k2, v2) list (v2) reduce (k2, list (v2))



Usage Examples

Distributed Grep, Count of URL Access Frequency, Reverse Web-Link Graph, Term-Vector per-Host, Inverted Index, Distributed Sort



MapReduce: Data Processing Flow



MapReduce: Architecture

Other MapReduce Programs (0) split input files (k1,v1) list(k2,v2) notifying global writing



(hash(key) mod R) (k2,list(v2)) list(v2)



partitioning



(over GFS)



(over GFS)



MapReduce: Code Example

class WordCounter : public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i done()) { value += StringToInt(input->value()); input->NextValue(); } Emit(IntToString(value)); }} REGISTER_REDUCER(Adder); int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; for (int i = 1; i set_format(“text”); input->set_filepattern(argv[i]); input->set_mapper_class(“WordCounter”); MapReduceOutput* out = spec.output(); out->set_filebase(“/gfs/test/freq”); out->set_num_tasks(100); out->set_format(“text”); out->set_reducer_class(“Adder”); out->set_combiner_class(“Adder”); spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); MapReduceResult result; if (!MapReduce(spec, &result)) abort(); }



MapReduce: Fault-tolerance

Worker Failure

Re-execution of workers (map or reduce task) Completed map tasks (local disk) Completed reduce tasks (GFS) re-executed no need of re-execution



Master Failure

Periodic checkpointing of the master data structure re-execution



Semantics in the Presence of Failure

Guarantee atomic commits of map and reduce task outputs

map output: by master’s confirm reduce output: by atomic rename operation of GFS



Chubby



Chubby: Overview

Distributed lock service

Target: loosely-coupled distributed system

moderately large number of small machines connected by a highspeed network



Goals

reliability, availability, and easy-to-understand semantics throughput and storage capacity are considered secondary



Similar to a simple file system, but different from

whole-file read/write augmented with advisory locks and with event notifications



Usage in both GFS and Bigtable

for master election for discovering servers and finding the master as a well-known location to store a small amount of metadata as the root of their distributed data structures



Chubby: System structure

Bigtable master, tablet servers GFS master, chunkservers …



simple database



replicas list distributed consensus protocol • master election • database update



DNS server



Chubby: Interface

Similar to a file system interface

Example) /ls/datacenter000/servers/svr_10980

/ls: stand for lock service /datacenter000: Chubby cell’s name /local: client’s local Chubby cell /global: global Chubby cell /servers/svr_10980: interpreted within the named Chubby cell



Node(file & directory)’s metadata

three ACL filenames (reading, writing and changing ACL names) four monotonically increasing 64-bit numbers

an instance number, a content generation number, a lock generation number, an ACL generation number



Handles

returned when clients open nodes includes check digits, a sequence number, mode information



Chubby: Global cell



global cell

/ls/global



/ls/cellname



local cell



subtree /ls/global/master is mirrored to subtree /ls/cell/slave • • • • Chubby’s own ACLs Advertisement of presence to monitoring services Pointers to allow clients to locate large data sets such as Bigtable cells many configuration files for other systems



Chubby: API

APIs

Open(), Close(), Poison() GetContentsAndStat(), GetStat(), ReadDir()

contents and metadata read atomically and in entirety



SetContents(), SetACL()

written atomically and in entirety



Delete() Acquire(), TryAcquire(), Release() GetSequencer(), SetSequencer(), CheckSequencer()



Usage example: primary election

All potential primaries Open() and Acquire() The primary SetContents() : write its identity All replicas event notified and GetContentsAndStat()



Chubby: Database & Backup

Database Implementation

The first version: replicated version of Berkeley DB Now: writing a simple database

write ahead logging, snapshotting and atomic operations



Backup

Every few hours, the master writes a snapshot of its DB to a GFS file server in a different building Usage

disaster recovery initializing the DB of a newly replaced replica



Overall View



Bird’s View Revisited

Batch Clients C++ Sawzall Java Python Runtime Clients Cluster Management System



MapReduce Workqueue (Scheduler) Bigtable



Client Interface



Chubby



DB GFS

(Linux File System, Multithreading)



Google OS



Server Process View

Cluster Management System Chubby Cell



A Single Server

worker pool



Global Scheduler



Local Scheduler



M



M



R



Computation (MapReduce)



Tablet Bigtable Master GFS Master Tablet Server

SSTable (LG1) SSTable (LG2)



Database SSTable (Bigtable)

(LG3)



Chunkserver chunks



Storage (GFS)



Google Services over Google Platform



Google Analytics



Embedded JavaScript



_uacct = “xxxxxxxxxx"; urchinTracker();



Google Analytics

raw click table(~200TB) row column

tuple(URL,time) session info

com.abc.www:0001 com.abc.www:0027 com.abc.www:0050



summary table(~20TB) row column

website

com.abc.www



summary



… … …







tablet:com.abc.www (a.html~o.html) SSTable(GFS file) tablet:com.abc.www (p.html~z.html) SSTable(GFS file)



value:each session info



Map analyzing Map



key:website’s URL value:analyzed info



Reduce aggregating



Google Earth



preprocessing & consolidating



Google Earth

imagery table(~70TB, CF:8, LG:3) column row

geographic segment

(x1,y1),(x2,y2)



index table(~500GB , CF:7, LG:2) column row

geographic segment

(x1,y1),(x2,y2)



image sources



final images



tablet:(x0,y0),(x4,y4) SSTable(GFS file)



value:image source



key:segment value:final image



Map preprocessing



Reduce consolidating & indexing GFS (final images)



GFS (raw images)



Personalized Search



user histories

(web queries, click URLs, search keywords, …)



User Profile



Personalized Search

row

userid

jaesun_han jisoo1004 jk_tong



user table(~4TB, CF:93, LG:11) column

search web queries keywords click URLs user profile



tablet:ja ~ jn



value:web queries



Map

SSTable(GFS file)



key:userid value:user history



analyzing Map

value:click URLs



Reduce generating profile



Q&A




Share This Document


Related docs
Other docs by Piyush Bakshi
Dyson dc16 root 6 handheld vacuum
Views: 134  |  Downloads: 0
Mole Frijole
Views: 220  |  Downloads: 0
Oded Schram
Views: 101  |  Downloads: 0
Carin Ashley
Views: 5480  |  Downloads: 28
Msnbc.com Now Offering Rates Fresh Daily
Views: 90  |  Downloads: 1
Mountain Lion
Views: 201  |  Downloads: 1
Whalphin
Views: 114  |  Downloads: 0
Naughty by Nature
Views: 112  |  Downloads: 1
Angels and Demons
Views: 381  |  Downloads: 19
by registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!