google - Download Now PowerPoint

Document Sample
google - Download Now PowerPoint Powered By Docstoc
					Interesting Techniques of

Large Scale Internet Service Seminar Week #3 SPARCS ’01 lacrimosa

Contents
 



Part 0. Introduction to Google Part 1. Google’s Search Method  Google Vs Yahoo  Information Retrieval  Google’s IR Part 2. Google Linux Cluster  Challenges for web search  Google’s Hardware

Part 0. Introduction to Google


 

 

Founded : 1998.09 by Stanford Ph.D Larry Page, Sergey Brin. Mission : Organize the world information Traffic : 150M query/day, 1000+ queries/sec Initial fund : $25million Stock listed : 1999.6

Part I.

Google's Search Method

Google Vs. Yahoo (1/2)


Why Google found more useful result than Yahoo?





Yahoo - Subject Directory - Editors make index, put into category Google - Web search - “Spider” , “robot” or “bots” crawl the index.

Google Vs. Yahoo (2/2)
Yahoo’s Directory Service Maintained Directory DB Result Quality Human Categorized Smaller Fewer Higher quality Google’s Web Search Robot Weakly classified Vast Huge (sometimes excessive) Vulnerable to garbage pages

Web Characteristic - Search Engine’s View
     



Huge and completely heterogeneous. Uncontrolled Ill formed Multi lingual Commercial & illegal Irrelevant or misinformation Old version & duplicates How to find USEFUL page? → Let’s see Google’s solutions!



Traditional IR
 

IR : Information Retrieval Ranking by frequency of query term
Title: SPARCS …… Title: …… …SPARCS… Title: …… …

…SPARCS…
……

>

…SPARCS
sparcs…

>

…SPARCS

……





Library, news, articles → Traditional IR is good Remind characteristic of modern web → Is traditional IR still good?

Google’s IR
kaist.ac.kr
Circles …… <a href=…> SPARCS</a>

- Link Structure
.

.

sparcs.org S P A R C S

kldp.org
Reference … <a href=…> SPARCS</a> ……

.

……


Rank by NUMBER of the referring pages.
Silly Page Old page

Yahoo!

A
dummy.html /~person/links.html

B



Rank by QUALITY of the referring pages.

Google’s IR
Page A

- Page Rank
Page B

P

PR(P) = β * {1/4 * PR(A) + 1/3 * PR(B)} + (1 - β)
β? Random Surfer Model : surf forward over random out-link with probability β, jump to other site with 1-β. → the higher β, the more contents in the page

Advantages






Accuracy Higher ranked page is likely to useful one. Fairness Tampering result is HARD → no one can buy higher rank. Organized Easy to follow link structure. Performance?



Part II.

Google Linux Cluster

Google Facts


   


  

Traffic: 150M query/day, 1000+ queries/sec Size: 3+ Billion documents indexed Interface languages: 82 Servers: 15,000 World office locations: 12 Data centers: 6 Disk Storage: in Peta(=1015) bytes Update: Once a month Update amount: 10+ terra bytes

Challenges
  

Administering 15,000 of servers Debugging performance problems Handling bit errors

System Overview

Hardware Architecture

Hardware Architecture
Internet

256Gbps Switch

Load Balancer

00-node cluster

00-node cluster

00-node cluster

Local Cluster Architecture
2 X 100Mbps Ethernet Switch

40 ~ 80 1U or 2U PC servers rack mounted

CPU: Pentium III 400-800MHz HDD: 40-80Gbyte EIDE OS: 100% Linux

Ideal Hardware for Google


 

 

Shorter pipelines - bit-level manipulation lead to unpredictable branches. 64-bit address space Caches don't need to be large - temporal data locality is poor TCP acceleration (not urgent) Fast interconnect - but need to be cheap (100$/port including router)

Helpful for Google hardware
   

On-chip multiprocessing SMT / hyper-threading Integration (CPU/RAM) 100℃ tolerance chips

Non-helpful for Google hardware


 



Denser packaging - space isn't expensive Larger SMP machines. Variable voltage/clocking - doesn't change peak load SIMD features - no byte-level parallelism

Summary


  

Google Uses lots of PC Replication / Partition Google needs throughput, not peak speed. Google wish CPUs would use less power. Cheap is good.

Reference












About Google http://www.google.com/about.html Page Rank Overview http://www.google.com/technology/index.html Intel’s Success Stories http://www.intel.com/eBusiness/casestudies/snapshots/google. htm Photos http://bohnsack.com/photos/sc2002/page/2/ Search Engine Method http://www.eas.asu.edu/~p2pcom/seminar/070902-lintao.ppt Urs Hoelzle’s Lecture Video http://www.cs.washington.edu/info/videos/asx/colloq/UHoelzle_ 2002_11_05.asx


				
DOCUMENT INFO
Shared By:
Tags:
Stats:
views:359
posted:8/12/2009
language:English
pages:23