Anatomy of Google Service Platform
March 2, 2007 Jaesun Han (jshan0000@gmail.com) Contact : http://www.web2hub.com
Contents
Overview of Web 2.0 Technologies Google Service Platform
Google File System(GFS) Bigtable MapReduce Chubby
Case Study: Google Services over Platform
Google Analytics Google Earth Personalized Search
Web 2.0 Technology Map
Web 2.0 Technology Layer
Client Layer Front-End Layer Data Processing Layer
Raw Data External Data Source Distributed/Parallel Processing
Platform Layer
Distributed Storage Distributed File System
Processed Data
DB
XHTML, CSS, PHP, Python, RSS,Atom, Microformats, Ruby, RoR, OpenAPI, RIA Dojo, DWR, REST,JSON, (Ajax, Flex, Atlas, GWT, SOAP, XUL, XAML, Apache, Mashup Gadget) PHP, MySQL Recommendation (Collaborative Filtering) Ranking, Clustering, Data mining, Personalization, Social Network Analysis
Cluster Computing, Beowulf, Grid, Globus, Condor, P2P, DHT, MPI, Utility Computing, Virtualization, Autonomous Computing
Cluster Management
Google Service Platform
Platform Layer
Client Layer Front-End Layer Data Processing Layer
Raw Data External Data Source Distributed/Parallel Processing
Platform Layer
Distributed Storage Distributed File System
Processed Data
DB
XHTML, CSS, PHP, Python, RSS,Atom, Microformats, Ruby, RoR, OpenAPI, RIA Dojo, DWR, REST,JSON, (Ajax, Flex, Atlas, GWT, SOAP, XUL, XAML, Apache, Mashup Gadget) PHP, MySQL Recommendation (Collaborative Filtering) Ranking, Clustering, Data mining, Personalization, Social Network
Cluster Computing, Beowulf, Grid, Globus, Condor, P2P, DHT, MPI, Utility Computing, Virtualization, Autonomous Computing
Cluster Management
Google Service Platform
Services Service Library
Google OS Google Service Platform
Service Software 기술
Search engine, Email server, IM server, Map database, Various Web sites …
System Software 기술
Google Linux, Google File System, MapReduce Library, Chubby, BigTable Intelligent System, Programming Model(River, TACC), Replication/Redundancy …
Hardware 기술
Google Cluster
Clusters, Geographic distribution, Automated Setup, Automated Backup, Standard components, Commodity drives, Flexible co-location, Easy-access design …
• 450,000 or more servers (NYT) • All PC servers less than $1,000 • 40 or more pizza box servers per rack
Advantages
• Easy Development • Scalability • Robustness
Google Service Platform
Computation
Distributed data processing library
MapReduce (OSDI 2006)
Bigtable (OSDI 2006)
Distributed storage system for structured data
Storage
Distributed File System
(OSDI 2006) Distributed Lock Manager
GFS (SOSP 2003)
Chubby
GFS
GFS: Overview
Scalable distributed file system for large distributed data-intensive applications
Running on inexpensive commodity hardware Delivering high aggregate performance to a large number of clients
Features
user-level distributed file system centralized architecture
metadata: client <-> a single master data: client <-> chunkservers
64MB fixed large chunk size non-standard file system interface (not POSIX API)
create, delete, open, close, read, write, snapshot and record append
three replicas of a chunk no client caching (but caching metadata like chunk location)
GFS: Architecture
lookup table map In-Memory Data Structure • file and chunk namespaces • mapping from files to chunks • chunk locations operation log File creation/deletion File renaming Chunk addition/deletion Separation of control flow and data flow
File read/write
GFS: Write
pipelined data delivery to fully utilize each machine’s network bandwidth
primary and replicas locations
primary lease (initial timeout=60s) ordering write requests for the same chunk
GFS: Relaxed Consistency
write from 2
chunk1 A B
write from 4 0 1 2 3 4 5 6 7
chunk2
replicas of chunk2
case1
(chunk2: B -> A)
0 1 2 3
4 5 6 7
Undefined
case2
(chunk2: A -> B) 0 1 2 3 4 5 6 7
Consistent
GFS: Atomic Record Appends
write from 1
chunk1
0 1 2 3
Record Append
chunk1 A B
0 1 2 3
A B
write from 2 0 1 2 3 0 1 2 3
max record size : ¼ of the max chunk size ( padding)
error in writing B
case1
(B -> A)
case1
(B -> A)
0 1 2 3
replicas of chunk1
case2
(A -> B)
case2
(A -> B)
0 1 2 3
writing at the exact offset
Bigtable
Bigtable: Overview
Motivation
Lots of structured and semi-structured data
web crawl data, satellite image, user data, email, …
No commercial system big enough
Bigtable
Distributed storage system for structured data A sparse, distributed, persistent multi-dimensional sorted map Goals
wide applicability, scalability, high performance, and high availability
Target workloads
from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end-uses
Applications
more than 60 Google products and projects (Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth)
Bigtable: Data Model
Table
Column Family: the basic unit of access control
Timestamp
(Column Family:Qualifier) Column Key
Tablet
Row Key
Atomic read/write for a single row key
Tablets
the unit of distribution and load balancing
Indexing
(row:string, column:string, time:int64) string (com.cnn.www, anchor:my.look.ca, t8) “CNN.com”
Tablet com.web2hub.www
Bigtable: API
// Open the table Table *T = OpenOrDie(“/bigtable/web/webtable”); // Write a new anchor and delete an old anchor RowMutation r1(T, “com.cnn.www”); r1.Set(“anchor:www.c-span.org”, “CNN”); r1.Delete(“anchor:www.abc.com”); Operation op; Apply(&op, &r1) Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily(“anchor”); stream->SetReturnAllVersions(); scanner.Lookup(“com.cnn.www”); for (; !stream->Done(); stream->Next()) { printf(“%s %s %lld %s\n”, scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); }
Writing to Bigtable Metadata operations
Reading from Bigtable
Create/delete tables and column families, change metadata
Several other features of API
single-row transactions: atomic read-modify-write sequences execution of client-supplied scripts (written in Sawzall)
Bigtable: SSTable
Used internally to store Bigtable data Immutable, sorted file of key-value pairs data blocks + an index
block size is 64KB, but configurable an index is used to locate blocks loaded into memory when the SSTable is opened
key-value
64K block 64K block
64K block
index
Bigtable: Tablet & Locality Group
Locality Group1 contents Locality Group2 anchor Locality Group3 language checksum
Tablet
com.cnn.www
(abc.html ~ help.html) (100~200MB)
SSTable1
(100MB)
SSTable2
(50MB)
SSTable3
(30MB)
GFS chunks
64MB
64MB
64MB
64MB
Bigtable: Tablet Location
Features
Three-level hierarchy (served on tablet servers, not the master) client library’s caching and prefetching of tablet locations
row key (tablet’s table id + its end row) tablet location ex) webtable:com.cnn.www
128MB = 217x1KB row
Addressing 234 tablets
Bigtable: Tablet Assignment
Cluster Management System Tablet Servers
new exclusive lock /servers/tab_svr10
1) start a server
Tablet Server (tab_svr10)
8) 9) reassign ac de unassigned a let qui k e 4) t e re loc tablets ea m th an cr he on e t 2) lo c d ito re i u k r/ cq a se 3) rv er s 5) assign tablets Bigtable
k oc l
Chubby
6) check lock status
Master
7) failure or losing lock
Bigtable: Master Failure
Tablet Changes
Create Delete Merge Split initiated by master initiated by tablet server master lock /servers/master 5) reassign unassigned Chubby tablets 1) 3) check ac m assigned as qu te ire 2) & rl tablets oc ge sca k tl n/ ive se se rve rv rs er l is Bigtable t 4) scan METADATA tablets
Tablet Servers
Master
0) start a master
Bigtable: Read/Write
a single commit log per tablet server anchor v4.0
read on a merged view
memtable (sorted buffer)
anchor:www.abc.com ABC anchor:www.abc.com null anchor:www.c-span.org CNN
anchor v3.0 t1: Set(“anchor:www.c-span.org”, “CNN”) t2: Delete(“anchor:www.abc.com”) t3: Set(“anchor:www.abc.com”, “ABC”)
anchor v2.0
anchor v1.0
Fast writing: mutation is logged in memory Efficient reading: a merged view of sorted data structures
Bigtable: Compactions
v5.0
v4.0
v3.0
v2.0
v1.0
memtable
minor compaction
A new SSTable
v6.0
memtable + all SSTables Only one SSTable
major compaction
MapReduce
MapReduce: Overview
Motivation
Input data is large Lots of machines: hundreds of thousands of PC servers
MapReduce
Programming model and implementation for parallel processing large data sets parallelization, fault-tolerance, data distribution, and load balancing in a MapReduce library map & reduce functions
map (k1, v1) list (k2, v2) list (v2) reduce (k2, list (v2))
Usage Examples
Distributed Grep, Count of URL Access Frequency, Reverse Web-Link Graph, Term-Vector per-Host, Inverted Index, Distributed Sort
MapReduce: Data Processing Flow
MapReduce: Architecture
Other MapReduce Programs (0) split input files (k1,v1) list(k2,v2) notifying global writing
(hash(key) mod R) (k2,list(v2)) list(v2)
partitioning
(over GFS)
(over GFS)
MapReduce: Code Example
class WordCounter : public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { while ((i < n) && isspace(text[i])) i++; int start = i; while ((i < n) && !isspace(text[i])) i++; if (start < i) Emit(text.substr(start, i-start), “1”); }}} REGISTER_MAPPER(WordCounter); class Adder : public Reducer { virtual void Reduce(ReduceInput* input) { int64 value = 0; while (!input->done()) { value += StringToInt(input->value()); input->NextValue(); } Emit(IntToString(value)); }} REGISTER_REDUCER(Adder); int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); MapReduceSpecification spec; for (int i = 1; i < argc; i++) { MapReduceInput* input = spec.add_input(); input->set_format(“text”); input->set_filepattern(argv[i]); input->set_mapper_class(“WordCounter”); MapReduceOutput* out = spec.output(); out->set_filebase(“/gfs/test/freq”); out->set_num_tasks(100); out->set_format(“text”); out->set_reducer_class(“Adder”); out->set_combiner_class(“Adder”); spec.set_machines(2000); spec.set_map_megabytes(100); spec.set_reduce_megabytes(100); MapReduceResult result; if (!MapReduce(spec, &result)) abort(); }
MapReduce: Fault-tolerance
Worker Failure
Re-execution of workers (map or reduce task) Completed map tasks (local disk) Completed reduce tasks (GFS) re-executed no need of re-execution
Master Failure
Periodic checkpointing of the master data structure re-execution
Semantics in the Presence of Failure
Guarantee atomic commits of map and reduce task outputs
map output: by master’s confirm reduce output: by atomic rename operation of GFS
Chubby
Chubby: Overview
Distributed lock service
Target: loosely-coupled distributed system
moderately large number of small machines connected by a highspeed network
Goals
reliability, availability, and easy-to-understand semantics throughput and storage capacity are considered secondary
Similar to a simple file system, but different from
whole-file read/write augmented with advisory locks and with event notifications
Usage in both GFS and Bigtable
for master election for discovering servers and finding the master as a well-known location to store a small amount of metadata as the root of their distributed data structures
Chubby: System structure
Bigtable master, tablet servers GFS master, chunkservers …
simple database
replicas list distributed consensus protocol • master election • database update
DNS server
Chubby: Interface
Similar to a file system interface
Example) /ls/datacenter000/servers/svr_10980
/ls: stand for lock service /datacenter000: Chubby cell’s name /local: client’s local Chubby cell /global: global Chubby cell /servers/svr_10980: interpreted within the named Chubby cell
Node(file & directory)’s metadata
three ACL filenames (reading, writing and changing ACL names) four monotonically increasing 64-bit numbers
an instance number, a content generation number, a lock generation number, an ACL generation number
Handles
returned when clients open nodes includes check digits, a sequence number, mode information
Chubby: Global cell
global cell
/ls/global
/ls/cellname
local cell
subtree /ls/global/master is mirrored to subtree /ls/cell/slave • • • • Chubby’s own ACLs Advertisement of presence to monitoring services Pointers to allow clients to locate large data sets such as Bigtable cells many configuration files for other systems
Chubby: API
APIs
Open(), Close(), Poison() GetContentsAndStat(), GetStat(), ReadDir()
contents and metadata read atomically and in entirety
SetContents(), SetACL()
written atomically and in entirety
Delete() Acquire(), TryAcquire(), Release() GetSequencer(), SetSequencer(), CheckSequencer()
Usage example: primary election
All potential primaries Open() and Acquire() The primary SetContents() : write its identity All replicas event notified and GetContentsAndStat()
Chubby: Database & Backup
Database Implementation
The first version: replicated version of Berkeley DB Now: writing a simple database
write ahead logging, snapshotting and atomic operations
Backup
Every few hours, the master writes a snapshot of its DB to a GFS file server in a different building Usage
disaster recovery initializing the DB of a newly replaced replica
Overall View
Bird’s View Revisited
Batch Clients C++ Sawzall Java Python Runtime Clients Cluster Management System
MapReduce Workqueue (Scheduler) Bigtable
Client Interface
Chubby
DB GFS
(Linux File System, Multithreading)
Google OS
Server Process View
Cluster Management System Chubby Cell
A Single Server
worker pool
Global Scheduler
Local Scheduler
M
M
R
Computation (MapReduce)
Tablet Bigtable Master GFS Master Tablet Server
SSTable (LG1) SSTable (LG2)
Database SSTable (Bigtable)
(LG3)
Chunkserver chunks
Storage (GFS)
Google Services over Google Platform
Google Analytics
Embedded JavaScript
Google Analytics
raw click table(~200TB) row column
tuple(URL,time) session info
com.abc.www:0001 com.abc.www:0027 com.abc.www:0050
summary table(~20TB) row column
website
com.abc.www
summary
… … …
…
tablet:com.abc.www (a.html~o.html) SSTable(GFS file) tablet:com.abc.www (p.html~z.html) SSTable(GFS file)
value:each session info
Map analyzing Map
key:website’s URL value:analyzed info
Reduce aggregating
Google Earth
preprocessing & consolidating
Google Earth
imagery table(~70TB, CF:8, LG:3) column row
geographic segment
(x1,y1),(x2,y2)
index table(~500GB , CF:7, LG:2) column row
geographic segment
(x1,y1),(x2,y2)
image sources
final images
tablet:(x0,y0),(x4,y4) SSTable(GFS file)
value:image source
key:segment value:final image
Map preprocessing
Reduce consolidating & indexing GFS (final images)
GFS (raw images)
Personalized Search
user histories
(web queries, click URLs, search keywords, …)
User Profile
Personalized Search
row
userid
jaesun_han jisoo1004 jk_tong
user table(~4TB, CF:93, LG:11) column
search web queries keywords click URLs user profile
tablet:ja ~ jn
value:web queries
Map
SSTable(GFS file)
key:userid value:user history
analyzing Map
value:click URLs
Reduce generating profile
Q&A