Embed
Email

web-search

Document Sample

Shared by: wuzhenguang
Categories
Tags
Stats
views:
0
posted:
1/5/2012
language:
pages:
40
IELM 231: IT for Logistics and Manufacturing

Course Agenda

Introduction

IT applications design: Human-Computer Interface

Fundamental IT tools: sorting, searching

The Client-Server architecture, Interacting applications

IT in logistics, Case study 1: web-search

Search robots

Data processing, Data storage/retrieval (DB, indexes)

Data presentation: page ranking techniques



IT in logistics, Case study 2: web-based auctions

How auctions work

Web issues: session tracking

Web issues: secure communications

Web issues: cash transactions

The Internet and Web searching



Part 1. The infrastructure of the web

- The architecture of the Internet

- The ISO/OSI 7-layer communication protocol

- Client-server architecture for applications



Part 2. How to use a web search engine [case study: Google]



- Basic searching

- Advanced searching (Booleans) and filtering (setting up local engines)





Part 3. How search engines work



- Crawling the internet

- Indexing of information, database design

- Page ranking

History of search engines



1993: First web robot – World Wide Web Wanderer

Matthew Gray, Physics student from MIT

Objective: track all pages on web to monitor growth of the web



1994: First search engine – WebCrawler

Brian Pinkerton, CS student from U of Washington

Objective: download web pages, store the links linked to keyword-searchable DB



1994: Jerry’s Guide to the Internet

Jerry Yang, David Filo, Stanford University

Objective: crawl for web pages, organize them by content into hierarchies

 Yet Another Hierarchical Officious Oracle (Yahoo)



1994-97: Lycos, Infoseek, AltaVista, Excite, LookSmart (meta engine)



1998: Google (Sergey Brin, Larry Page, CS students, Stanford University)

Basics of web searching: two modes of searching



Web directories









[Boolean] keyword search

Basics of web directories





A hierarchy of main topics is formed (by search company)



Each topic may have several sub-topics, etc.







All pages ‘known’ to engine are linked to the hierarchy by humans (editors)





Users search for ‘main’ articles in any subtopic by ‘browsing’





Pages deep inside a hierarchy can be searched using ‘keyword search’

Basics of web searching: user’s view



Basic search modes



Any page with the search term(s) occurring in it is returned



What if you search of multiple words?



implicit AND: page must contain all the words in search term





Exact phrase search



Typically by quoting the search term: “now is the time”



What happens if you use: now is the time





Typically: is and the will be omitted, results will differ from above!

Basics of web searching: user’s view..



Boolean (Set theoretic) operations on search results:



Intersection: AND (implicit)  manufacturing die casting



Union: OR  manufacturing die OR shell casting



Difference (NOT): ‘-’  manufacturing cold forging -hot





Some useful Google™ search modes:



Definition of a word or phrase  define:search



Page contains numbers in a given range  ming emperor 1400..1500





Simple calculator functions  150*(1+0.05)**5-150

Basics of web searching: user’s view…



Filtering



Anchor: the text in an HTML tag that carries the hyperlink to a page

Example:

IELM 330 web page





Anchors have special significance in Google (we shall see this later)



Anchor text often indicates the content of the page it links



Search term must occur in the anchor text  inanchor:manufacturing-wiki



Search term must occur in the title of page  allintitle:joneja ajay



Focused search



Search only on pages that occur within a particular site  IELM 330 site:ust.hk

The Internet and Web searching



Part 1. The infrastructure of the web

- The architecture of the Internet

- The ISO/OSI 7-layer communication protocol

- Client-server architecture for applications



Part 2. How to use a web search engine [case study: Google]



- Basic searching

- Advanced searching (Booleans) and filtering (setting up local engines)





Part 3. How search engines work



- Crawling the internet

- Indexing of information, database design

- Page ranking

Collecting information on the internet: Crawlers





How do search engines know which page has what data?





- By the use of robots (also called spiders or crawlers)

- By asking sites to tell them who they are (registering)





Crawling:

Start with a small list of web-pages (seeds) := OPEN

For each URL in OPEN:

Fetch page;

Process page (index its contents  see later);

Add all web-links (URL’s) found on the page to OPEN

Problems in crawling



Too many web-sites > 100 million

Each website has many pages (estimated total: ~ 25 billion)



Some pages may not be linked from any seed



Dynamic page generation (e.g. DB based, or rapidly changing)





Suppose a web-site generates a page based on some inputs from user



How can we program a crawler to get this content ?



What potential problems can be faced by this technique?



Art gallery website:

- 3 methods to sort thumbnails; 3 choices of font size; 2 skins 

18 URL’s (using HTTP GET with different options) for same page!

Crawler behavior





Basic crawler is easy to program

Policies used by crawlers determine their efficiency:





Selection policy: which page to download (among all URL’s in OPEN)







Breadth first is better than depth first why?

Crawler behavior..

Re-visit policy: how often to re-visit a page (to account for dynamic content)



How often to re-visit? Depends on how frequently the page is changed



out-dated data  cost [but how to estimate this cost?]



1, if p is identical to local copy

Freshness of a page p in repository at time t = 0, otherwise



0, if p is identical to local copy

Age of a page p in repository at time t =

t- time when p was modified, otherwise









[source: wikipedia]

Crawler behavior...



Policies used by crawlers determine their efficiency:



Politeness policy: how not to overload resources





Crawlers can send download requests faster than the server, the network

or the indexer can handle





Typical policy: wait n seconds before sending next page request to same server





n = 10 sec … 180 sec

Crawler behavior….



Policies used by crawlers determine their efficiency:



Parallelization policy: how to use multiple crawlers in a coordinated search





advantage: the search engine can use multiple computers to build indexes





problems:

- how to coordinate the merge the results of the multiple computers

- how to ensure that same page is not downloaded by more than one process

Search engines: Indexing



What to do with the pages downloaded by the crawler?





Processing the data:

- categorization (mime-types)

- text processing

- delete stop words [Why?] [How?]

- Identify and handle words

- Identify and handle phrases [How many ?]

- Handling similar words

- plurals, past/present, mis-spellings: stemming

Search engines: Indexing..

Categorization (mime-types)



The indexer must know what type of data it received, and handle it accordingly:



A/V mime-types: audio/mpeg, audio/x-wav, audio/x-ms-wma, video/mpeg, …



Currently: check meta data (anchor text, ALT-tags, HTTP Header, …

Future: Digital signal processing  voice recognition  text  index



Image mime-types: image/gif, image/jpeg, image/png, image/tiff, …



Currently: check meta data (anchor text, ALT-tags, HTTP Header, …)



Application mime-types: application/javascript, application/xhtml+xml,

application/octet-stream, application/x-shockwave-flash, …



Currently: If interpreted language  handle as text, otherwise meta data



Other mime-types: e.g. 3D data etc  meta data, filenames, etc.

Search engines: Indexing…



Text processing



- Identify and handle words



what is a word?

- English: dictionary

- Programming: contiguous alphanumeric with ‘_’

- hyphenated [how does Google handle hyphenated search-terms?]

- special characters



- delete stop words



Stop words: very common words in text, such as the, a, an, on, … ?



How to define? language experts (human) make the list



How to handle? string-match  do not index stop words

Search engines: Indexing….



- Identify and handle phrases



what is a phrase?

- English: dictionary of phrases, human experts

- How to identify: string match

- How many?

- Too many  Index is too large

- Too few  real-time Boolean operations during search (takes time!)





Google n-gram dataset: ~24 GB compressed text files



Number of tokens: 1,024,908,267,229

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663

Search engines: Indexing…..



- Handling similar words

- plurals, past/present, mis-spellings: stemming



- Search for ‘expert’ == ? == Search for ‘experts’



Stemming: To reduce a given word to a ‘canonical’ root word



Basic methods:

- Use a dictionary

- Analyze sentence, guess the meaning of the word used, find root from look-up table

- Use simple processing rules (English)

- if word ends in ‘ed’, ‘ing’, ‘ly’  remove ‘ed’, ‘ing’, ‘ly’

- How to identify mis-spellings?

- Dictionary lookup and match  expensive

- Look-up table with list of commonly misspelled words

Search engines: Indexing……



Inverted indexes: How to store processed data



Most common storage structure: Inverted Index



ngram Documents



storm URL1, URL6, …, URL_n



teacup URL1, …, URL_k









How to store?



- Database file (lookup time may be long)



- Hash (faster lookup on average)



Challenge: Updating an index [Why?] [How?]



Challenge: Merging indexes [Why?  Parallel processing] [How?]

Search engines: presenting results to user



Typical query may match 100 millions web pages (hits)









Typical user may look at perhaps 3-5 pages to find the information they want



Successful search engine  must provide a link to ‘best’ page among top few





How to rank the relative likelihood that a page will provide ‘wanted’ info?



- Google uses PageRank™

Search engines: PageRank





How to rank the relative likelihood that a page will provide ‘wanted’ info?



- Google uses PageRank™





PageRank: Invented by PhD student, Larry Page



Similar older ideas, e.g. link analysis (graph theory), citation index





Main idea of PageRank:

1. Importance of page is related to how many other pages link to it, and

2. How important are those pages



Is PageRank effective? [Google’s market capitalization: US$ 200B!]

Page Ranking



1 3

Consider a web with 4 pages





2 4



Arrow from A  B => Page A has a link to Page B



Let: importance score of page k = score of k = xk



Simplest scheme: xk = number of back-links to page k



x1 = 2, x2 = 1, x3 = 3, x4 = 2 Problem ?



does it account for condition 2?



Main Reference for PageRank notes: Bryan and Leise, The 25B eigenvector: The Linear Algebra Behind Google

Page Ranking.. 1 3







Let: importance score of page k = score of k = xk 2 4





Suppose page j is more important (e.g. yahoo.com) 

those pages that are linked from page j should be important 

additional importance of page linked from page j  xj



xj

Let Lk be the set of page k’s backlinks, then: xk  n

jLk j







where nj = number of outgoing links on page j



Note: we ignore any link from a page to itself

Page Ranking… 1 3





xj

Let Lk be the set of page k’s backlinks, then: xk  n

jLk

2 4

j









x1 = x3 + x4 /2

x1 0 0 1 1/2 x1

x2 = x1 /3 1/3 0 0 0

x2 x2

=

x3 = x1 /3 + x2/2 + x4 /2 x3 1/3 1/2 0 1/2 x3

x4 1/3 1/2 0 0 x4

x4 = x1 /3 + x2 /2

x A x



Recall from Linear Algebra that for an nXn square matrix A, a scalar l is an

eigenvalue if there is an eigenvector x ≠ 0 such that Ax = lx



PageRank of pages is the eigenvector of A corresponding to eigenvalue = 1

Page Ranking…. 1 3





x1 0 0 1 1/2 x1

2 4

x2 1/3 0 0 0 x2

=

x3 1/3 1/2 0 1/2 x3

x4 1/3 1/2 0 0 x4



To find the eigenvector: solve the system using Gaussian elimination



12 12/31 .387

4 .129

In our example, any scalar multiple of: Normalize: 4/31 =

9 9/31 .29

6 6/31 .194





Problem: Is it necessary that the link matrix of any web has eigenvalue = 1?

Page Ranking …..



Problem: Is it necessary that the link matrix of any web has eigenvalue = 1?



Dangling node: A web page that has no outgoing links



Theorem:

The link matrix of a web with no dangling nodes always has 1 as an eigenvalue



Examine matrix A:

Aij = 1/nj, if page j links to page i

Aij = 0, otherwise

 sum of all entries in column j = total (unweighted) contribution of page j = 1





Column stochastic matrix: A square matrix with no negative entries, and

sum of each column = 1

Page Ranking …...



Column stochastic matrix: A square matrix with no negative entries, and

sum of each column = 1



 Our link matrix is a column stochastic matrix



Theorem: Every column stochastic matrix has 1 as an eigenvalue



Proof:

1. (linear algebra): square matrix A and its Transpose, AT have same eigenvalues

2. Let e = an nX1 vector with all entries = 1

3. A is column stochastic  AT e = e

 AT has an eigenvector = e, (with eigenvalue = 1)

 A has an eigenvalue = 1.

QED

Page Ranking issues



Problems with the Link Matrix approach, and How to handle them





Non-unique eigenvector for eigenvalue = 1



Dangling nodes





Computational effort and time



Other factors that also contribute to ‘importance of the page’

Page Ranking issues: non-unique eigenvector



Consider the web in the figure 0 1 0 0 0

1 0 0 0 0

1 3

A= 0 0 0 1 ½

5 0 0 1 0 ½

2 4 0 0 0 0 0





What are the eigenvectors for A, corresponding to eigenvalue = 1 ?



½ 0 .375

½ 0 .375

Any linear combination of: e.g.:

0 ½ .125

0 ½ .125

0 0 0



What is the relative importance of the pages??

Page Ranking issues: non-unique eigenvector..



Consider the web in the figure 0 1 0 0 0

1 0 0 0 0

1 3

A= 0 0 0 1 ½

5 0 0 1 0 ½

2 4 0 0 0 0 0





Intuitive reason for the problem: this web can be partitioned into two independent sub-webs



Mathematically: Non-unique eigenvector results when A has a block diagonal form



A1 0 … 0

0 A2 … 0 Solution:

A= - Reorganize A into block diagonal form

0 …

0 0 … Ak - Find Page Rank for each block independently

Page Ranking issues: Dangling nodes



Suppose a web has dangling nodes (any page with no outgoing links)





 Corresponding column of A is all 0’s





 The matrix A is not column stochastic (it is column sub-stochastic)





 For column sub-stochastic matrices, there must exist at least one

eigenvalue, 0 < l < 1, whose eigenvector has all non-negative entries





[Note: this eigenvector is called the Perron eigenvector]

[Note 2: we will omit the proof of its existence in this course]





 We could use the Perron eigenvector to get page ranks (but we won’t)

Dangling nodes: A probabilistic view of the Link matrix



Consider an imaginary, randomized web-surfer:

He starts at a random page in the web and spends time t on the page.

He then jumps randomly to one of the links in that page





The Page Rank of a page = expected time spent on that page by the randomized surfer!





However: if our “randomized surfer” ends up on a dangling node, then he is

stuck there for ever ?!



One solution: If we are stuck at a dangling node, jump to a random page in the web





How to model this view-point into our matrix A ?



The S matrix: For any 0-column ( dangling node) in A, make all entries = 1/n



Question: Is the matrix S column stochastic ?

Page Ranking Issues: Other factors influencing rank



Main idea (Google’s damping factor):

Allow our randomized surfer to make a purely random jump “occasionally”





How to implement ?

1/n 1/n … 1/n

1/n 1/n … 1/n

What is the link matrix of a purely random surfer? R=



Note: R is column stochastic 1/n 1/n … 1/n





Consider the Google matrix, G = d S + (1 – d) R d  [0, 1]



Note: G is column stochastic (Why?)



Effect of using G: With probability d, we follow the randomized surfer S, and with

probability (1 – d), we make a purely random jump.

Page Ranking Issues: Other factors influencing rank



Main idea (Google’s damping factor):

Allow our randomized surfer to make a purely random jump “occasionally”





Google matrix, G = d S + (1 – d) R





Google found that using matrix G, and damping factor d = 0.85 “works very

well” ( gives ranking that tends to rank “useful pages” higher)

Page Ranking Issues: Computing the eigenvector



The Rank of page n is the n-th element of the eigenvector, I, of the G matrix

for eigenvalue = 1





Size of G: approx 25Billion X 25 Billion !!!



 Using Gaussian elimination will take months on a supercomputer!





Properties of G:

Large size (25Billion X 25 Billion)

Sparse (approx 10 non-zero entries in typical non-dangling node column)

1 is the largest eigenvalue for column stochastic matrix G



We can use the Power Method to compute the I-vector

Page Ranking Issues: Computing the eigenvector..



The Power Method to compute the eigenvector of the dominant eigenvalue



Start with an arbitrary rank vector, I0

Iterate:

I1 = G I0 , …, …, Ik+1 = G Ik



Ik converges to the eigenvector of the dominant eigenvalue





 We only need to do matrix multiply for sparse matrix (~10 multiply per column)

[NOTE: by writing G = d S + (1 – d) R, the effect of R is merely to add a fixed

quantity to each row of I  faster computation]





Brin and Page: Typically, Google needs 50-100 iterations to get a good I-vector

Typically, Google computers take few days to calculate

Typically Google updates rankings once per month (Google dance)

Concluding notes



Why is web-search important?

- For researchers, only practical way to find information from web





- For E-commerce companies: higher rank  larger sales (why?)





This is the reason for high stock value of Google, Baidu, …





Current trend:

- Page Ranking method dominates Directory-structured search engines





Technology note:

- Page Ranking: a practical application of abstract Linear Algebra concepts!

References and Further Reading





Books:

Sherman and Price, The Invisible Web: Uncovering Information Sources Search

Engines Can’t See, Cyber Age Books, 2001

[Note: some information in this book is already obsolete]









Web sources:

1. Googleguide: google search guide



2. David Austin’s article on how Google PageRank works









Next: Auctions, Web-transactions, Ebay



Related docs
Other docs by wuzhenguang
Is Air Quality a Problem in My Home
Views: 8  |  Downloads: 0
IHRM Chapter 6
Views: 9  |  Downloads: 0
37.10593
Views: 7  |  Downloads: 0
December_break
Views: 8  |  Downloads: 0
Lectures for 2nd Edition
Views: 9  |  Downloads: 0
Google Chart
Views: 30  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!