Embed
Email

Google

Document Sample
Google
Shared by: Shoaib Ahmed
Stats
views:
2
posted:
1/19/2012
language:
pages:
51
Hidden ways to get into Google

We are living in the age of the Internet and online businesses are thriving as people want to buy

goods or services from the comfort of their homes/ offices and avoid physical shopping. There is

stiff competition leading to a congestion of e-commerce sites and all the site owners are

clamoring to attain top search engine rankings.



Some webmasters are constantly looking for innovative and hitherto untried ways to maximize

the traffic to their sites through opportunities (other than SEO) that Google provides. Thus, there

are the more imaginative ones who believe in the presence of some hidden ways to get into

Google and keep exploring them - taking advantage of every possible resource.





Google Base

To those who may not be aware, Google Base came into being towards the end of 2005 and is

a location where a user can add a variety of information like product details, research

documents, articles, press releases etc which will be made searchable by Google. It is believed

that anyone having a Google account can avail of this service. If that be so, it is certainly worth

the effort as Google base can fetch visitors for your content.









Google Local

Again, placing your product details in Google Local can attract the attention of quite a few

prospective buyers. It is a fact that users frequently visit Google Local searching for

product/or service in their own vicinity. Assuming you run a small business and offer mass

consumption products/services, Google Local can be the ideal place to attract more customers

in a cost-effective way.



Google Maps

Google Product Search

Please know that Google Product Search provides visitors with a directory of products category-

wise, a searchable index of online products, and the facility to narrow searches by price range.

By placing your products into Google Product Search, you can attract the right buying

community to your website. Needless to say, for many online business owners, there is the

likelihood that Google Product Search already has few or all of your products listed. The more

assured way is to get into the Google Product Search directory is to furnish them with a data

feed that lists your product details in an enticing manner and also mentioning the relevant URLs.









Google Blog Search

Lest you forget, one of the positive ways to get some additional listings in Google is the Google

Blog Search. If you have a website, then the obvious question is what prevents you from

starting a blog. Please understand that by starting a blog, you will be indexed on a number of

blog searches, including Google Blog search. This is one effective mode of marketing that is

mostly untapped, and it is for you to exploit the situation and go for the kill.









Google Video

If you are not already aware, Google Video lets you to have your own video show and this is a

form of multimedia that is gaining in popularity – though now in nascent stages. You can utilize

the video facility to demonstrate your product for viewers to better understand.









Google News

Last but not least in importance, you can place your write-up and even your picture into Google

News. You may have to initially sign become a contributor. If accepted as a contributor, you will

be able to submit articles along with pictures. Your content will reach millions of searchers who

daily visit Google News

Once you recognize the multiple faces of Google, you will be able to use many, if not all of

them, to your business advantage. It is common knowledge that to achieve more, you have to

be inventive and put in the extra efforts.









Optimizing sites for TV

Google has now officially launched Google TV. The purpose is to provide users with a remote

interface but with all the major web functions in tact. This means you can access a site on the

Google TV in the same way you do on a desktop. To achieve this, the webmasters will be

required to optimize their sites to make it compatible for proper viewing in Google TV.



The term ‘Optimizing sites for TV’ needs to be explained at least for those who may not be

familiar with this concept. The phrase simply means a user should be able to comfortably sit in a

favorite couch to watch his/her site on television. The text has to be sufficiently bold so that it

can be viewed from an ideal distance sitting in the couch without straining one’s eyes. The site

navigation is to be made possible through button arrows on the remote instead of the standard

mouse/touchpad usage.



Inasmuch as mobile phones make your site accessible to persons on the move, Google TV

makes your site viewable to people relaxing on the couch. Google TV is a platform that

synthesizes your current TV programming with the web. It is indeed exciting to know that your

favorite web and your favorite television program are made available at once from the comfort of

your sofa.



This is technically possible as Google TV has a fully functioning built-in web browser to enable

users to easily visit your site from their television. Now that this has become feasible you may

obviously want to give your users a richer and more enhanced TV viewing experience – what is

now familiarly referred to as the “10-foot UI” (user interface). They will be several feet away from

the screen and use a remote with a keyboard and a pointing device and not a mouse.



Webmasters wanting to obtain a general idea about their site’s appearance on television should

know that mere appearance alone is no proof that the site can be comfortably navigated by TV

viewers who use a remote rather than a mouse. However, there can still be some techniques to

know how the site would look on television.

On a large monitor, adjust your window size to measure 1920 x 1080 and in a browser, visit

your site at full screen. You can then zoom the browser to 1.5 times the usual size. This

zooming can be achieved in varied ways with different keyboards. Then move back three times

the distance between yourself and the monitor and check out your site. If you want to see your

site with the real thing, then go ahead and buy the Google TV enabled devices now widely

available.



It is certainly not compulsory for every site to be optimized - at least not immediately, as the light

of Google TV is yet to shadow the desktop version of websites. The webmaster must deal with

typography that includes the selection of fonts, sizes, color, letter-spacing, special effects etc.

Choosing the right colors is very important when optimizing a site for viewing on Google TV. If

you are less experienced, then you can opt for simple image sliders, rotators which are, as it is,

available on the net.



But if you are aware of coding, then you may as well go for a complete flash or jQuery site. Use

AJAX wherever needed - mainly while changing the tabs. This will mean the user need not

meddle too often on the remote control, as the page will be updated easily by just moving the

dock on the remote.









Google Toolbar PageRank Update List (TBPR)

Google updates it's toolbar PageRank and Backlink data once in two months or quarterly.

This page provides you a list of dates when PageRank was updated visibly in Google toolbar.

Search Engine Genie offers you the most updated information on PageRank updates; we

monitor all PageRank updates done by Google on regular basis.



Monitoring PageRank update is a passion for us and we enjoy doing it. There are no real

benefits with this list, you should take this list for granted as it doesn't serve any purpose when it

comes to search engine optimization and search engine rankings.



We have a page very similar to this list. Its a list of PageRank 10 sites compiled and sorted out

using in built algorithm whenever there is a PageRank update. Our PageRank 10 sites list gives

you clear picture of how well we monitor PageRank updates.







Google has never announced a Google PageRank update schedule as to when updates would

occur. Truth is Google updates PageRank whenever they are feeling like. Google PageRank

update frequency is not constant it may vary as every month, 2, 3, or 4 months.



After the Google PageRank updation on 25th June 2009 it took almost 4 months for next

Google PageRank update in 2009 which was on October 30th. Google PageRank update

December 2009 was done before New Year that is 30th which was the last update of 2009.



By Referring to our previous Google PageRank update you can see five updates taken place in

year 2007, 2008 and 2009. So we can assume next Google PageRank update in 2010 will also

have five updates. Next Google PageRank update most probably will take place by the end of

February 2010. If Google follow 3 months once update, then the next PageRank update will

take place by 30th March 2010



The below list gives you the updation of google toolbar PageRank since it started, to the lastest

updation it has made.









Toolbar PageRank UPDATED ON YEAR





Toolbar PageRank updated on July 19th 2011





Toolbar PageRank updated on June 27th 2011





Toolbar PageRank updated on January 20th 2011







Toolbar PageRank UPDATED ON YEAR - 2010





Toolbar PageRank updated on April 3rd 2010







Toolbar PageRank UPDATED ON YEAR - 2009





Toolbar PageRank updated on December 30th 2009





Toolbar PageRank updated on October 30th 2009





Toolbar PageRank updated on June 25th 2009

Toolbar PageRank updated on May 27th 2009





Toolbar PageRank updated on April 1st 2009







Toolbar PageRank UPDATED ON YEAR - 2008





Toolbar PageRank updated on December 31st 2008





Toolbar PageRank updated on September 26th 2008





Toolbar PageRank updated on July 25th 2008





Toolbar PageRank updated on April 30th 2008





Toolbar PageRank updated on February 29th 2008







Toolbar PageRank UPDATED ON YEAR - 2007





Toolbar PageRank updated on December 12th 2007





Toolbar PageRank updated on October 28th 2007





Toolbar PageRank updated on April 30th 2007





Toolbar PageRank updated on January 25th 2007





Toolbar PageRank updated on January10th 2007

Toolbar PageRank UPDATED ON YEAR - 2006





Toolbar PageRank updated on September 28th 2006





TBPR updated on July 30th 2006





Toolbar PageRank updated on April 4th 2006





Toolbar PageRank updated on April 7th 2006





Toolbar PageRank updated on February 18th 2006





Toolbar PageRank updated on May 27th 2006





Toolbar PageRank updated on January 30th 2006





Toolbar PageRank updated on January 4th 2006







Toolbar PageRank UPDATED ON YEAR - 2005





Toolbar PageRank updated on December 19th 2005





Toolbar PageRank updated on November 4th 2005





Toolbar PageRank updated on October 27th 2005





Toolbar PageRank updatedon October 19th 2005

Toolbar PageRank updated on October 15th 2005





Toolbar PageRank updated on September 4th 2005





Toolbar PageRank updated on July 15th 2005





Toolbar PageRank updated on June 11th 2005





Toolbar PageRank updated on June 8th 2005





Toolbar PageRank updated on May 27th 2005





Toolbar PageRank updated on May 24th 2005





Toolbar PageRank updated on July 25th 2005





Toolbar PageRank updated on April 22nd 2005





Toolbar PageRank updated on March 4th 2005





Toolbar PageRank updated on March 3rd 2005





Toolbar PageRank updated on Februaru 23rd 2005





Toolbar PageRank updated on February 4th 2005





Toolbar PageRank updated on February 3rd 2005





Toolbar PageRank updated on January10th 2005

Toolbar PageRank updated on January 1st 2005







Toolbar PageRank UPDATED ON YEAR - 2004





Toolbar PageRank updated on December 26th 2004





BPR updated on November 25th 2004





Toolbar PageRank updated on October 28th 2004





Toolbar PageRank updated on October 18th 2004





Toolbar PageRank updated on October 17th 2004





Toolbar PageRank updated on October 16th 2004





Toolbar PageRank updated on October 6th 2004





Toolbar PageRank updated on September 10th 2004





Toolbar PageRank updated on September 8th 2004





Toolbar PageRank updated on August 30th 2004





Toolbar PageRank updated on August 10th 2004





Toolbar PageRank updated on August 9th 2004

Toolbar PageRank updated on July 16th 2004





Toolbar PageRank updated on June 22nd 2004





Toolbar PageRank updated on May 31th 2004





Toolbar PageRank updated on Aprilt 23rd 2004





Toolbar PageRank updated on April 7th 2004





Toolbar PageRank updated on March 16th 2004





Toolbar PageRank updated on February 11th 2004





Toolbar PageRank updated on January 26th 2004





Toolbar PageRank updated on January 11th 2004







Toolbar PageRank UPDATED ON YEAR - 2003





Toolbar PageRank updated on December 23rd 2003





Toolbar PageRank updated on December 6th 2003





Toolbar PageRank updated on November 16th 2003





Toolbar PageRank updated on October 26th 2003

Toolbar PageRank updated on October 2nd 2003





Toolbar PageRank updatedon September 21th 2003





Toolbar PageRank updated on April 30th 2003





Toolbar PageRank updated on August 29th 2003





Toolbar PageRank updated on August 8th 2003





Toolbar PageRank updated on June 15th 2003





Toolbar PageRank updated on May 6th 2003





Toolbar PageRank updated on April 11th 2003





Toolbar PageRank updatedon March 6th 2003





Toolbar PageRank updated on January 25th 2003





Toolbar PageRank updated on January 1st 2003







Toolbar PageRank UPDATED ON YEAR - 2002





Toolbar PageRank updated on November 27th 2002





Toolbar PageRank updated on October 31th 2002

Toolbar PageRank updated on September 26th 2002





Toolbar PageRank updated on August 21th 2002





Toolbar PageRank updated on July 25th 2002





Toolbar PageRank updated on June 23rd 2002





Toolbar PageRank updated on May 24th 2002





Toolbar PageRank updated on April 25th 2002





Toolbar PageRank updated on April 6th 2002





Toolbar PageRank updated on February 20th 2002





Toolbar PageRank updated on January 25th 2002







Toolbar PageRank UPDATED ON YEAR - 2001





Toolbar PageRank updated on December 27th 2001





Toolbar PageRank updated on November 25th 2001





Toolbar PageRank updated on October 28th 2001





Toolbar PageRank updated on August 21th 2001

Toolbar PageRank updated on September 16th 2001





Toolbar PageRank updated on August 19th 2001





Toolbar PageRank updated on July 19th 2001





Toolbar PageRank updated on June 22rd 2001





Toolbar PageRank updated on May 21th 2001





Toolbar PageRank updated on April 23rd 2001





Toolbar PageRank updated on April 6th 2001





Toolbar PageRank updated on March 26th 2001





Toolbar PageRank updated on February 19th 2001





Toolbar PageRank updated on January 21th 2001







Toolbar PageRank UPDATED ON YEAR - 2000





Toolbar PageRank updated on December 19th 2000





Toolbar PageRank updated on Nonember 18th 2000





Toolbar PageRank updated on Octobert 22nd 2000

Toolbar PageRank updated on August 29th 2000





Toolbar PageRank updated on July 26th 2000



Google’s Data centers





IP Addresses and Domains of Google's Data Centers

The progression of a Google Dance could basically be watched by querying the IP addresses of Google's

data centers. But queries on the IP addresses are normally redirected to www.google.com . However,

Google has domains, which resolve to the single data centers' IP addresses. These domains as well as

their IP addresses are shown in the following list.





Google Data Centers List June 2008



209.85.137.nnn - gfe-mg/mg2/mg3

209.85.141.nnn - gfe-pq/pq2/pq3

209.85.147.nnn - gfe-wa/wa2/wa3

209.85.149.nnn - gfe-hr

209.85.153.nnn - gfe-im

209.85.159.nnn - gfe-fh/fh2/fh3

209.85.161.nnn - gfe-mk/mk2/mk3/mk4

209.85.163.nnn - gfe-?/?2/?3/?4 - (NO GFE name allocated - direct access by IP address only - where

nnn is 107,104,99,147,103, etc).

209.85.165.nnn - gfe-eo/eo2/eo3/eo4

209.85.167.nnn - gfe-da/da2/da3/da4

209.85.169.nnn - gfe-rc/rc2/rc3/rc4

209.85.171.nnn - gfe-cg/cg2/cg3/cg4

209.85.173.nnn - gfe-mh/mh2/mh3/mh4

209.85.175.nnn - gfe-tc/tc2/tc3

209.85.193.nnn - gfe-br/br2/br3/br4 - (access by direct IP address and by br-in-fnnn.google.com URL

format only - where nnn is 107,104,99,147,103, etc - NO ACCESS by GFE name).

209.85.195.nnn - gfe-rg/rg2 - (access by direct IP address and by rg-in-fnnn.google.com URL format only

- where nnn is 107,104,99 (but not 147,103), etc - NO ACCESS by GFE name).

209.85.197.nnn - gfe-bd

209.85.201.nnn - gfe-wf/wf2/wf3/wf4

209.85.203.nnn - gfe-pn/pn2/pn3/pn4

209.85.207.nnn - gfe-ya/ya2/ya3/ya4

209.85.209.nnn - gfe-yb

209.85.215.nnn - gfe-ye

209.85.237.nnn - gfe-? - (NO GFE name allocated - direct access by IP address only - where nnn is

107,104 (but not 99,147,103), etc).

Those without a GFE name or which resolve only by direct IP address are the most interesting.



Domain IP Address

www-ex.google.com 216.239.33.100

www-sj.google.com 216.239.35.100

www-va.google.com 216.239.37.100

www-dc.google.com 216.239.39.100

www-ab.google.com 216.239.51.100

www-in.google.com 216.239.53.100

www-zu.google.com 216.239.55.100

www-cw.google.com 216.239.57.100

www-fi.google.com 216.239.41.100

www-gv.google.com 216.239.59.100



Older Data Center List click here

Note:

Searches at www-zu and www-sj are currently redirected to other data centers. Since results for searches

at their IP addresses fluctuate heavily during a Google Dance, also these searches seem to be internally

routed to other data centers.

Those that keep an eye on Google's index updates often think that the Google Dance is over, when they

see the new index at www.google.com or when they don't see the old index at www.google.com for some

time. In fact, the update is not finished until all the domains listed above provide results from the new

index.



The index updates at the single data centers seem to happen at one point in time. As soon as one data

center shows results from the new index, it won't switch back to the old index. This happens most likely

because the index is redundant at each data center and at first, only one part of the servers (eventually

half of them) is updated. During this period, only the other half of the servers is active and provides

search results.



As soon as the update of the first half of servers is finished, they become active and provide search

results while the other half receives the new index. Thus, from the user's perspective, the update of one

data centers happens at one point in time.



www-ab.google.com, www-cw.google.com, www-dc.google.com, www-ex.google.com, www-

fi.google.com, www-in.google.com, www-sj.google.com, www-va.google.com and www-zu.google.com

point at Google's datacentres.



When those nine return the same backlinks, that would normally be the sign that the update has finished.

It's best to think of www.google.com, www2.google.com, www3.google.com and

toolbarqueries.google.com as load balancing domains that can be pointed at datacentres.



www2.google.com and www3.google.com have been pointed at datacentres with the new index some

time before www.google.com starts to show it. There's no reason why any of these four domains

shouldn't swap between old and new indices during the update process, hence the term 'Google dance'.



The unusual thing this time is that the link: search and Toolbar PageRank returned by the nine www-

domains appear to be old. It would seem sensible to expect a second half to this update, unless Google

have something else up their sleeve.







GOOGLE CHROME



The search giant is introducing the beta version of the latest platform in 100 countries as it bids

to take on Internet Explorer and Firefox.

I was as blindsided by Google Chrome as everyone, but there's no negate that its release was a

significant occasion. Chrome is more than just a browser; it represents Google's most recent

entry in the ongoing discussion of standards and paramount practices for the Web as an

application platform. As such, it may be the company's most imperative offering since Gears

which is an open source project that enables more powerful web applications, by accumulation

of new features to your web browser:



"From my outlook, Google Chrome and Gears are entering the Web from two directions," says

Google's Adam Goodman in Scott McCloud's comic-strip introduction to Chrome. "The [Chrome]

browser project is an endeavor to make the Web better for users. The Gears team wants to

make the Web enhanced for developers."

We are so fortunate, Google Chrome ships with all of the functionality of the Gears plug-in for

Firefox and Internet Explorer previously baked in. But I don't agree that users are the only

beneficiaries of the other endeavor that's gone into Chrome. There's lots more going on beneath

the hood of the new browser that should concern developers, too.

Roaring engines



For learners, Google's verdict to use the WebKit HTML rendering engine may not be

astounding, but it's still noteworthy. WebKit, an improved version of the open source KHTML

project, is the rendering engine used by Apple for its Safari browser. It's also the engine found in

both Apple's iPhone and Google Android, arguably the two most important mobile Web

platforms today. WebKit's HTML and JavaScript code began as a branch of the KHTML and

KJS libraries from KDE. This website is also the home of S60's S60 WebKit development.



That means Google Chrome isn't yet another browser to support,



Pretty, it's one more vote in favor of making WebKit a primary target for new Web development

projects. It only makes sense to test against the engine that's available on the extensive range

of platforms and devices.



And esteem isn't the only reason why WebKit was a good choice. Current WebKit builds in a

very hasty manner. As a result, pages normally pop up more quickly in Chrome than in Firefox

3. More prominently, WebKit leads the pack in Web standards compliance (with Opera a close

second). The more developers get on board with writing abundant standards-compliant code,

free of undocumented tricks and browser-specific hacks, the better it will be for everyone.



I enthusiastically recommend reading the introductory comic strip for more impending into how

Google rebuilt the browser from the ground up. Everything, from the browser security model to

how plug-ins is managed, has been crafted with the extreme care. Google Chrome really is the

next step in the ongoing project to create the eventual client for the Web application platform

Google PageRank Explained





(C)2002 eFactory

A Survey of Google's PageRank

Within the past few years, Google has become the far most utilized search engine worldwide. A

decisive factor therefore was, besides high performance and ease of use, the superior quality of

search results compared to other search engines. This quality of search results is substantially

based on PageRank, a sophisticated method to rank web documents.

The aim of this page is to provide a broad survey of all aspects of PageRank. The contents of

these pages primarily rest upon papers by Google founders Lawrence Page and Sergey Brin

from their time as graduate students at Stanford University.

It is often argued that, especially considering the dynamic of the internet, too much time has

passed since the scientific work on PageRank, as that it still could be the basis for the ranking

methods of the Google search engine. There is no doubt that within the past years most likely

many changes, adjustments and modifications regarding the ranking methods of Google have

taken place, but PageRank was absolutely crucial for Google's success, so that at least the

fundamental concept behind PageRank should still be constitutive.



The PageRank Concept

Since the early stages of the world wide web, search engines have developed different methods

to rank web pages. Until today, the occurence of a search phrase within a document is one

major factor within ranking techniques of virtually any search engine. The occurence of a search

phrase can thereby be weighted by the length of a document (ranking by keyword density) or by

its accentuation within a document by HTML tags.

For the purpose of better search results and especially to make search engines resistant

against automatically generated web pages based upon the analysis of content specific ranking

criteria (doorway pages), the concept of link popularity was developed. Following this concept,

the number of inbound links for a document measures its general importance. Hence, a web

page is generally more important, if many other web pages link to it. The concept of link

popularity often avoids good rankings for pages which are only created to deceive search

engines and which don't have any significance within the web, but numerous webmasters elude

it by creating masses of inbound links for doorway pages from just as insignificant other web

pages.

Contrary to the concept of link popularity, PageRank is not simply based upon the total number

of inbound links. The basic approach of PageRank is that a document is in fact considered the

more important the more other documents link to it, but those inbound links do not count

equally. First of all, a document ranks high in terms of PageRank, if other high ranking

documents link to it.

So, within the PageRank concept, the rank of a document is given by the rank of those

documents which link to it. Their rank again is given by the rank of documents which link to

them. Hence, the PageRank of a document is always determined recursively by the PageRank

of other documents. Since - even if marginal and via many links - the rank of any document

influences the rank of any other, PageRank is, in the end, based on the linking structure of the

whole web. Although this approach seems to be very broad and complex, Page and Brin were

able to put it into practice by a relatively trivial algorithm



The PageRank Algorithm

The original PageRank algorithm was described by Lawrence Page and Sergey Brin in several

publications. It is given by

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where

PR(A) is the PageRank of page A,

PR(Ti) is the PageRank of pages Ti which link to page A,

C(Ti) is the number of outbound links on page Ti and

d is a damping factor which can be set between 0 and 1.



So, first of all, we see that PageRank does not rank web sites as a whole, but is determined for

each page individually. Further, the PageRank of page A is recursively defined by the

PageRanks of those pages which link to page A.

The PageRank of pages Ti which link to page A does not influence the PageRank of page A

uniformly. Within the PageRank algorithm, the PageRank of a page T is always weighted by the

number of outbound links C(T) on page T. This means that the more outbound links a page T

has, the less will page A benefit from a link to it on page T.

The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional

inbound link for page A will always increase page A's PageRank.

Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d

which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a page by

another page linking to it is reduced.



The Random Surfer Model

In their publications, Lawrence Page and Sergey Brin give a very simple intuitive justification for

the PageRank algorithm. They consider PageRank as a model of user behaviour, where a

surfer clicks on links at random with no regard towards content.

The random surfer visits a web page with a certain probability which derives from the page's

PageRank. The probability that the random surfer clicks on one link is solely given by the

number of links on that page. This is why one page's PageRank is not completely passed on to

a page it links to, but is devided by the number of links on the page.

So, the probability for the random surfer reaching one page is the sum of probabilities for the

random surfer following links to this page. Now, this probability is reduced by the damping factor

d. The justification within the Random Surfer Model, therefore, is that the surfer does not click

on an infinite number of links, but gets bored sometimes and jumps to another page at random.



The probability for the random surfer not stopping to click on links is given by the damping factor

d, which is, depending on the degree of probability therefore, set between 0 and 1. The higher d

is, the more likely will the random surfer keep clicking links. Since the surfer jumps to another

page at random after he stopped clicking links, the probability therefore is implemented as a

constant (1-d) into the algorithm. Regardless of inbound links, the probability for the random

surfer jumping to a page is always (1-d), so a page has always a minimum PageRank.



A Different Notation of the PageRank Algorithm

Lawrence Page and Sergey Brin have published two different versions of their PageRank

algorithm in different papers. In the second version of the algorithm, the PageRank of page A is

given as

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where N is the total number of all pages on the web. The second version of the algorithm,

indeed, does not differ fundamentally from the first one. Regarding the Random Surfer Model,

the second version's PageRank of a page is the actual probability for a surfer reaching that

page after clicking on many links. The PageRanks then form a probability distribution over web

pages, so the sum of all pages' PageRanks will be one.



Contrary, in the first version of the algorithm the probability for the random surfer reaching a

page is weighted by the total number of web pages. So, in this version PageRank is an

expected value for the random surfer visiting a page, when he restarts this procedure as often

as the web has pages. If the web had 100 pages and a page had a PageRank value of 2, the

random surfer would reach that page in an average twice if he restarts 100 times.



As mentioned above, the two versions of the algorithm do not differ fundamentally from each

other. A PageRank which has been calculated by using the second version of the algorithm has

to be multiplied by the total number of web pages to get the according PageRank that would

have been caculated by using the first version. Even Page and Brin mixed up the two algorithm

versions in their most popular paper "The Anatomy of a Large-Scale Hypertextual Web Search

Engine", where they claim the first version of the algorithm to form a probability distribution over

web pages with the sum of all pages' PageRanks being one.



In the following, we will use the first version of the algorithm. The reason is that PageRank

calculations by means of this algorithm are easier to compute, because we can disregard the

total number of web pages.



The Characteristics of PageRank

The characteristics of PageRank shall be illustrated by a small example.

We regard a small web consisting of three pages A, B and C, whereby page A links to the

pages B and C, page B links to page C and page C links to page A. According to Page and Brin,

the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5.

The exact value of the damping factor d admittedly has effects on PageRank, but it does not

influence the fundamental principles of PageRank. So, we get the following equations for the

PageRank calculation:

PR(A) = 0.5 + 0.5 PR(C)

PR(B) = 0.5 + 0.5 (PR(A) / 2)

PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

These equations can easily be solved. We get the following PageRank values for the single

pages:

PR(A) = 14/13 = 1.07692308

PR(B) = 10/13 = 0.76923077

PR(C) = 15/13 = 1.15384615

It is obvious that the sum of all pages' PageRanks is 3 and thus equals the total number of web

pages. As shown above this is not a specific result for our simple example.



For our simple three-page example it is easy to solve the according equation system to

determine PageRank values. In practice, the web consists of billions of documents and it is not

possible to find a solution by inspection.



The Iterative Computation of PageRank





Because of the size of the actual web, the Google search engine uses an approximative,

iterative computation of PageRank values. This means that each page is assigned an initial

starting value and the PageRanks of all pages are then calculated in several computation circles

based on the equations determined by the PageRank algorithm. The iterative calculation shall

again be illustrated by our three-page example, whereby each page is assigned a starting

PageRank value of 1.

Iteration PR(A) PR(B) PR(C)

0111

1 1 0.75 1.125

2 1.0625 0.765625 1.1484375

3 1.07421875 0.76855469 1.15283203

4 1.07641602 0.76910400 1.15365601

5 1.07682800 0.76920700 1.15381050

6 1.07690525 0.76922631 1.15383947

7 1.07691973 0.76922993 1.15384490

8 1.07692245 0.76923061 1.15384592

9 1.07692296 0.76923074 1.15384611

10 1.07692305 0.76923076 1.15384615

11 1.07692307 0.76923077 1.15384615

12 1.07692308 0.76923077 1.15384615

We see that we get a good approximation of the real PageRank values after only a few

iterations. According to publications of Lawrence Page and Sergey Brin, about 100 iterations

are necessary to get a good approximation of the PageRank values of the whole web.

Also, by means of the iterative calculation, the sum of all pages' PageRanks still converges to

the total number of web pages. So the average PageRank of a web page is 1. The minimum

PageRank of a page is given by (1-d). Therefore, there is a maximum PageRank for a page

which is given by dN+(1-d), where N is total number of web pages. This maximum can

theoretically occur, if all web pages solely link to one page, and this page also solely links to

itself





The Implementation of PageRank in the Google Search Engine

Regarding the implementation of PageRank, first of all, it is important how PageRank is

integrated into the general ranking of web pages by the Google search engine. The proceedings

have been described by Lawrencec Page and Sergey Brin in several publications. Initially, the

ranking of web pages by the Google search engine was determined by three factors:



Page specific factors

Anchor text of inbound links

PageRank



Page specific factors are, besides the body text, for instance the content of the title tag or the

URL of the document. It is more than likely that since the publications of Page and Brin more

factors have joined the ranking methods of the Google search engine. But this shall not be of

interest here.

In order to provide search results, Google computes an IR score out of page specific factors and

the anchor text of inbound links of a page, which is weighted by position and accentuation of the

search term within the document. This way the relevance of a document for a query is

determined. The IR-score is then combined with PageRank as an indicator for the general

importance of the page. To combine the IR score with PageRank the two values are

multiplicated. It is obvious that they cannot be added, since otherwise pages with a very high

PageRank would rank high in search results even if the page is not related to the search query.



Especially for queries consisting of two or more search terms, there is a far bigger influence of

the content related ranking criteria, whereas the impact of PageRank is mainly visible for

unspecific single word queries. If webmasters target search phrases of two or more words it is

possible for them to achieve better rankings than pages with high PageRank by means of

classical search engine optimisation.



If pages are optimised for highly competitive search terms, it is essential for good rankings to

have a high PageRank, even if a page is well optimised in terms of classical search engine

optimisation. The reason therefore is that the increase of IR score deminishes the more often

the keyword occurs within the document or the anchor texts of inbound links to avoid spam by

extensive keyword repetition. Thereby, the potentialities of classical search engine optimisation

are limited and PageRank becomes the decisive factor in highly competitive areas.





The PageRank Display of the Google Toolbar

PageRank became widely known by the PageRank display of the Google Toolbar. The Google

Toolbar is a browser plug-in for Microsoft Internet Explorer which can be downloaded from the

Google web site. The Google Toolbar provides some features for searching Google more

comfortably.

The Google Toolbar displays PageRank on a scale from 0 to 10. First of all, the PageRank of an

actually visited page can be estimated by the width of the green bar within the display. If the

user holds his mouse over the display, the Toolbar also shows the PageRank value.



Caution: The PageRank display is one of the advanced features of the Google Toolbar. And if

those advanced features are enabled, Google collects usage data. Additionally, the Toolbar is

self-updating and the user is not informed about updates. So, Google has access to the user's

hard drive.

If we take into account that PageRank can theoretically have a maximum value of up to dN+(1-

d), where N is the total number of web pages and d is usually set to 0.85, PageRank has to be

scaled for the display on the Google Toolbar. It is generally assumed that the scalation is not

linearly but logarithmically. At a damping factor of 0.85 and, therefore, a minimum PageRank of

0.15 and at an assumed logaritmical basis of 6 we get a scalation as follows:



Toolbar PageRank Real PageRank

0/10 0.15 - 0.9

1/10 0.9 - 5.4

2/10 5.4 - 32.4

3/10 32.4 - 194.4

4/10 194.4 - 1,166.4

5/10 1,166.4 - 6,998.4

6/10 6,998.4 - 41,990.4

7/10 41,990.4 - 251,942.4

8/10 251,942.4 - 1,511,654.4

9/10 1,511,654.4 - 9,069,926.4

10/10 9,069,926.4 - 0.85 × N + 0.15





It is uncertain if in fact a logarithmical scalation in a strictly mathematical sense takes place.

There is likely a manual scalation which follows a logarithmical scheme, so that Google has

control over the number of pages within the single Toolbar PageRank ranges. The logarithmical

basis for this scheme should be between 6 and 7, which can for instance be rudimentary

deduced from the number of inbound links of pages with a high Toolbar PageRank from pages

with a Toolbar PageRank higher than 4, which are shown by Googe using the link command.

The Toolbar's PageRank Files





Even webmasters who do not want to use the Google Toolbar or the Internet Explorer

permanently for security and privacy concerns have the possibility to check the PageRank

values of their pages. Google submits PageRank values in simple text files to the Toolbar. In

former times, this happened via XML. The switch to text files occured in August 2002.

The PageRank files can be requested directly from the domain www.google.com. Basically, the

URLs for those files look like follows (without line breaks):

http://www.google.com/search?

client=navclient-auto&

ch=0123456789&

features=Rank&

q=info:http://www.domain.com/

There is only one line of text in the PageRank files. The last cipher in this line is PageRank.



The parameters incorporated in the above shown URL are inevitable for the display of the

PageRank files in a browser. The value "navclient-auto" for the parameter "client" identifies the

Toolbar. Via the parameter "q" the URL is submitted. The value "Rank" for the parameter

"features" determines that the PageRank files are requested. If it is omitted, Google's servers

still transmit XML files. The parameter "ch" transfers a checksum for the URL to Google,

whereby this checksum can only change when the Toolbar version is updated by Google.



Thus, it is necessary to install the Toolbar at least once to find out about the checksum of one's

URLs. To track the communication between the Toolbar and Google, often the use of packet

sniffers, local proxies an similar tools is suggested. But this is not necessarily needed, since the

PageRank files are cached by the Internet Explorer. So, the checksums can simply been found

out by having a look at the folder Temporary Internet Files. Knowing the checksums of your

URLs, you can view the PageRank files in your browser and you do not have to accept Google's

36 years lasting cookies.



Since the PageRank files are kept in the browser cache and, thus, are clearly visible, and as

long as requests are not automated, watching the PageRank files in a browser should not be a

violation of Google's Terms of Service. However, you should be cautious. The Toolbar submits

its own User-Agent to Google. It is:

Mozilla/4.0 (compatible; GoogleToolbar 1.1.60-deleon; OS SE 4.10)

1.1.60-deleon is a Toolbar version which may of course change. OS is the operating system

that you have installed. So, Google is able to identify requests by browsers, if they do not go out

via a proxy and if the User-Agent is not modified accordingly.



Taking a look at IE's cache, one will normally notice that the PageRank files are not requested

from the domain www.google.com but from IP addresses like 216.239.33.102. Additionally, the

PageRank files' URLs often contain a parameter "failedip" that is set to values like

"216.239.35.102;1111" (Its function is not absolutely clear). The IP addresses are each related

to one of Google's seven data centers and the reason for the Toolbar querying IP-addresses is

most likely to control the PageRank display in a better way, especially in times of the "Google

Dance".



The PageRank Display at the Google Directory

Webmasters who do not want to check the PageRank files that are used by the toolbar have

another possibility to receive information about the PageRank of their sites by means of the

Google Directory (directory.google.com).



The Google Directory is a dump of the Open Directory Project (dmoz.org), which shows the

PageRank for listed documents similarly to the Google Toolbar display scaled and by means of

a green bar. In contrast to the Toolbar, the scale is from 1 to 7. The exact value is not displayed,

but it can be determined by the divided bar respectively the width of the single graphics in the

source code of the page if one is not sure by looking at the bar.

By comparing the Toolbar PageRank of a document with its Directory PageRank, a more exact

estimation of a pages PageRank can be deduced, if the page is listed with the ODP. This

connection was mentioned first by Chris Raimondi (www.searchnerd.com/pagerank).



Especially for pages with a Toolbar PageRank of 5 or 6, one can appraise if the page is on the

upper or the lower end of its Toolbar scale. It shall be noted that for the comparison the Toolbar

PageRank of 0 was not taken into account. It can easily be verified that this is appropriate by

looking at pages with a Toolbar PageRank of 3. However, it has to be considered that for a

verification pages of the Google Directory respectively the ODP with a Toolbar PageRank of 4

or lower have to be chosen, since otherwise no pages linked from there with a Toolbar

PageRank of 3 will be found





The Effect of Inbound Links

It has already been shown that each additional inbound link for a web page always increases

that page's PageRank. Taking a look at the PageRank algorithm, which is given by

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

one may assume that an additional inbound link from page X increases the PageRank of page

A by

d × PR(X) / C(X)

where PR(X) is the PageRank of page X and C(X) is the total number of its outbound links. But

page A usually links to other pages itself. Thus, these pages get a PageRank benefit also. If

these pages link back to page A, page A will have an even higher PageRank benefit from its

additional inbound link.



The single effects of additional inbound links shall be illustrated by an example.

We regard a website consisting of four pages A, B, C and D which are linked to each other in

circle. Without external inbound links to one of these pages, each of them obviously has a

PageRank of 1. We now add a page X to our example, for which we presume a constant

Pagerank PR(X) of 10. Further, page X links to page A by its only outbound link. Setting the

damping factor d to 0.5, we get the following equations for the PageRank values of the single

pages of our site:

PR(A) = 0.5 + 0.5 (PR(X) + PR(D)) = 5.5 + 0.5 PR(D)

PR(B) = 0.5 + 0.5 PR(A)

PR(C) = 0.5 + 0.5 PR(B)

PR(D) = 0.5 + 0.5 PR(C)

Since the total number of outbound links for each page is one, the outbound links do not need to

be considered in the equations. Solving them gives us the following PageRank values:

PR(A) = 19/3 = 6.33

PR(B) = 11/3 = 3.67

PR(C) = 7/3 = 2.33

PR(D) = 5/3 = 1.67

We see that the initial effect of the additional inbound link of page A, which was given by

d × PR(X) / C(X) = 0,5 × 10 / 1 = 5

is passed on by the links on our site.



The Influence of the Damping Factor

The degree of PageRank propagation from one page to another by a link is primarily determined

by the damping factor d. If we set d to 0.75 we get the following equations for our above

example:

PR(A) = 0.25 + 0.75 (PR(X) + PR(D)) = 7.75 + 0.75 PR(D)

PR(B) = 0.25 + 0.75 PR(A)

PR(C) = 0.25 + 0.75 PR(B)

PR(D) = 0.25 + 0.75 PR(C)

Solving these equations gives us the following PageRank values:

PR(A) = 419/35 = 11.97

PR(B) = 323/35 = 9.23

PR(C) = 251/35 = 7.17

PR(D) = 197/35 = 5.63

First of all, we see that there is a significantly higher initial effect of additional inbound link for

page A which is given by

d × PR(X) / C(X) = 0.75 × 10 / 1 = 7.5



This initial effect is then propagated even stronger by the links on our site. In this way, the

PageRank of page A is almost twice as high at a damping factor of 0.75 than it is at a damping

factor of 0.5. At a damping factor of 0.5 the PageRank of page A is almost four times superior to

the PageRank of page D, while at a damping factor of 0.75 it is only a little more than twice as

high. So, the higher the damping factor, the larger is the effect of an additional inbound link for

the PageRank of the page that receives the link and the more evenly distributes PageRank over

the other pages of a site.



The Actual Effect of Additional Inbound Links

At a damping factor of 0.5, the accumulated PageRank of all pages of our site is given by

PR(A) + PR(B) + PR(C) + PR(D) = 14

Hence, by a page with a PageRank of 10 linking to one page of our example site by its only

outbound link, the accumulated PageRank of all pages of the site is increased by 10. (Before

adding the link, each page has had a PageRank of 1.) At a damping factor of 0.75 the

accumulated PageRank of all pages of the site is given by

PR(A) + PR(B) + PR(C) + PR(D) = 34

This time the accumulated PageRank increases by 30. The accumulated PageRank of all pages

of a site always increases by

(d / (1-d)) × (PR(X) / C(X))

where X is a page additionally linking to one page of the site, PR(X) is its PageRank and C(X)

its number of outbound links. The formula presented above is only valid, if the additional link

points to a page within a closed system of pages, as, for instance, a website without outbound

links to other sites. As far as the website has links pointing to external pages, the surplus for the

site itself diminishes accordingly, because a part of the additional PageRank is propagated to

external pages.



The justification of the above formula is given by Raph Levien and it is based on the Random

Surfer Model. The walk length of the random surfer is an exponential distribution with a mean of

(d/(1-d)). When the random surfer follows a link to a closed system of web pages, he visits on

average (d/(1-d)) pages within that closed system. So, this much more PageRank of the linking

page - weighted by the number of its outbound links - is distributed to the closed system.

For the actual PageRank calculations at Google, Lawrence Page und Sergey Brin claim to

usually set the damping factor d to 0.85. Thereby, the boost for a closed system of web pages

by an additional link from page X is given by

(0.85 / 0.15) × (PR(X) / C(X)) = 5.67 × (PR(X) / C(X))

So, inbound links have a far larger effect than one may assume.



The PageRank-1 Rule

Users of the Google Toolbar often notice that pages with a certain Toolbar PageRank have an

inbound link from a page with a Toolbar PageRank which is higher by one. Some take this

observation to doubt the validity of the PageRank algorithm presented here for the actual

ranking methods of the Google search engine. It shall be shown, however, that the PageRank-1

rule complies with the PageRank algorithm.



Basically, the PageRank-1 rule proves the fundamental principle of PageRank. Web pages are

important themselves if other important web pages link to them. It is not necessary for a page to

have many inbound links to rank well. A single link from a high ranking page is sufficient.



To show the actual consistance of the PageRank-1 rule with the PageRank algorithm several

factors have to be taken into consideration. First of all, the toolbar PageRank is a logarithmically

scaled version of real PageRank values. If the PageRank value of one page is one higher than

the PageRank value of another page in terms of Toolbar PageRank, than its real PageRank can

at least be higher by an amount which equals the logarithmical basis for the scalation of Toolbar

PageRank. If the logarithmical basis for the scalation is 6 and the toolbar PageRank of a linking

Page is 5, then the real PageRank of the page which receives the link can be at least 6 times

smaller to make that page still get a toolbar PageRank of 4.



However, the number of outbound links on the linking page thwarts the effect of the

logarithmical basis, because the PageRank propagation from one page to another is devided by

the number of outbound links on the linking page. But it has already been shown that the

PageRank benefit by a link is higher than PageRank algorithm's term d(PR(Ti)/C(Ti)) pretends.

The reason is that the PageRank benefit for one page is further distributed to other pages within

the site. If those pages link back as it usualy happens, the PageRank benefit for the page which

initially received the link is accordingly higher. If we assume that at a high damping factor the

logarithmical basis for PageRank scalation is 6 and a page receives a PageRank benefit which

is twice as high as the PageRank of the linking page devided by the number of its outbound

links, the linking page could have at least 12 outbound links so that the Toolbar PageRank of

the page receiving the link is still at most one lower than the toolbar PageRank of the linking

page.



A number of 12 outbound links admittedly seems relatively small. But normally, if a page has an

external inbound link, this is not the only one for that page. Most likely other pages link to that

page and propagate PageRank to it. And if there are examples where a page receives a single

link from another page and the PageRanks of both pages comply the PageRank-1 rule although

the linking page has many outbound links, this is first of all an indication for the linking page's

toolbar PageRank being at the upper end of its scale. The linking page could be a "high" 5 and

the page receiving the link could be a "low" 4. In this way, the linking page could have up to 72

outbound links. This number rises accordingly if we assume a higher logarithmical basis for the

scalation of Toolbar PageRank





The Effect of Outbound Links

Since PageRank is based on the linking structure of the whole web, it is inescapable that if the

inbound links of a page influence its PageRank, its outbound links do also have some impact.

To illustrate the effects of outbound links, we take a look at a simple example.

We regard a web consisting of to websites, each having two web pages. One site consists of

pages A and B, the other constists of pages C and D. Initially, both pages of each site solely link

to each other. It is obvious that each page then has a PageRank of one. Now we add a link

which points from page A to page C. At a damping factor of 0.75, we therefore get the following

equations for the single pages' PageRank values:



PR(A) = 0.25 + 0.75 PR(B)

PR(B) = 0.25 + 0.375 PR(A)

PR(C) = 0.25 + 0.75 PR(D) + 0.375 PR(A)

PR(D) = 0.25 + 0.75 PR(C)

Solving the equations gives us the following PageRank values for the first site:

PR(A) = 14/23

PR(B) = 11/23

We therefore get an accumulated PageRank of 25/23 for the first site. The PageRank values of

the second site are given by

PR(C) = 35/23

PR(D) = 32/23

So, the accumulated PageRank of the second site is 67/23. The total PageRank for both sites is

92/23 = 4. Hence, adding a link has no effect on the total PageRank of the web. Additionally, the

PageRank benefit for one site equals the PageRank loss of the other.





The Actual Effect of Outbound Links

As it has already been shown, the PageRank benefit for a closed system of web pages by an

additional inbound link is given by

(d / (1-d)) × (PR(X) / C(X)),



where X is the linking page, PR(X) is its PageRank and C(X) is the number of its outbound links.

Hence, this value also represents the PageRank loss of a formerly closed system of web pages,

when a page X within this system of pages now points by a link to an external page.

The validity of the above formula requires that the page which receives the link from the

formerly closed system of pages does not link back to that system, since it otherwise gains back

some of the lost PageRank. Of course, this effect may also occur when not the page that

receives the link from the formerly closed system of pages links back directly, but another page

which has an inbound link from that page. Indeed, this effect may be disregarded because of

the damping factor, if there are enough other web pages in-between the link-recursion. The

validity of the formula also requires that the linking site has no other external outbound links. If it

has other external outbound links, the loss of PageRank of the regarded site diminishes and the

pages already receiving a link from that page lose PageRank accordingly.



Even if the actual PageRank values for the pages of an existing web site were known, it would

not be possible to calculate to which extend an added outbound link diminishes the PageRank

loss of the site, since the above presented formula regards the status after adding the link.





Intuitive Justification of the Effect of Outbound Links

The intuitive justification for the loss of PageRank by an additional external outbound link

according to the Random Surfer Modell is that by adding an external outbound link to one page

the surfer will less likely follow an internal link on that page. So, the probability for the surfer

reaching other pages within a site diminishes. If those other pages of the site have links back to

the page to which the external outbound link has been added, also this page's PageRank will

deplete.



We can conclude that external outbound links diminish the totalized PageRank of a site and

probably also the PageRank of each single page of a site. But, since links between web sites

are the fundament of PageRank and indespensable for its functioning, there is the possibility

that outbound links have positive effects within other parts of Google's ranking criteria. Lastly,

relevant outbound links do constitute the quality of a web page and a webmaster who points to

other pages integrates their content in some way into his own site.



Dangling Links

An important aspect of outbound links is the lack of them on web pages. When a web page has

no outbound links, its PageRank cannot be distributed to other pages. Lawrence Page and

Sergey Brin characterise links to those pages as dangling links.

The effect of dangling links shall be illustrated by a small example website. We take a look at a

site consisting of three pages A, B and C. In our example, the pages A and B link to each other.

Additionally, page A links to page C. Page C itself has no outbound links to other pages. At a

damping factor of 0.75, we get the following equations for the single pages' PageRank values:



PR(A) = 0.25 + 0.75 PR(B)

PR(B) = 0.25 + 0.375 PR(A)

PR(C) = 0.25 + 0.375 PR(A)

Solving the equations gives us the following PageRank values:

PR(A) = 14/23

PR(B) = 11/23

PR(C) = 11/23



So, the accumulated PageRank of all three pages is 36/23 which is just over half the value that

we could have expected if page A had links to one of the other pages. According to Page and

Brin, the number of dangling links in Google's index is fairly high. A reason therefore is that

many linked pages are not indexed by Google, for example because indexing is disallowed by a

robots.txt file. Additionally, Google meanwhile indexes several file types and not HTML only.

PDF or Word files do not really have outbound links and, hence, dangling links could have major

impacts on PageRank.



In order to prevent PageRank from the negative effects of dangling links, pages wihout

outbound links have to be removed from the database until the PageRank values are computed.

According to Page and Brin, the number of outbound links on pages with dangling links is

thereby normalised. As shown in our illustration, removing one page can cause new dangling

links and, hence, removing pages has to be an iterative process. After the PageRank calculation

is finished, PageRank can be assigned to the formerly removed pages based on the PageRank

algorithm. Therefore, as many iterations are needed as for removing the pages. Regarding our

illustration, page C could be processed before page B. At that point, page B has no PageRank

yet and, so, page C will not receive any either. Then, page B receives PageRank from page A

and during the second iteration, also page C gets its PageRank.

Regarding our example website for dangling links, removing page C from the database results

in page A and B each having a PageRank of 1. After the calculations, page C is assigned a

PageRank of 0.25 + 0.375 PR(A) = 0.625. So, the accumulated PageRank does not equal the

number of pages, but at least all pages which have outbound links are not harmed from the

danging links problem.



By removing dangling links from the database, they do not have any negative effects on the

PageRank of the rest of the web. Since PDF files are dangling links, links to PDF files do not

diminish the PageRank of the linking page or site. So, PDF files can be a good means of search

engine optimisation for Google





The Effect of the Number of Pages

Since the accumulated PageRank of all pages of the web equals the total number of web

pages, it follows directly that an additional web page increases the added up PageRank for all

pages of the web by one. But far more interesting than the effect on the added up PageRank of

the web is the impact of additional pages on the PageRank of actual websites.



To illustrate the effects of addional web pages, we take a look at a hierachically structured web

site consisting of three pages A, B and C, which are joined by an additional page D on the

hierarchically lower level of the site. The site has no outbound links. A link from page X which

has no other outbound links and a PageRank of 10 points to page A. At a damping factor d of

0.75, the equations for the single pages' PageRank values before adding page D are given by

PR(A) = 0.25 + 0.75 (10 + PR(B) + PR(C))

PR(B) = PR(C) = 0.25 + 0.75 (PR(A) / 2)

Solving the equations gives us the follwing PageRank values:

PR(A) = 260/14

PR(B) = 101/14

PR(C) = 101/14

After adding page D, the equations for the pages' PageRank values are given by

PR(A) = 0.25 + 0.75 (10 + PR(B) + PR(C) + PR(D))

PR(B) = PR(C) = PR(D) = 0.25 + 0.75 (PR(A) / 3)

Solving these equations gives us the follwing PageRank values:

PR(A) = 266/14

PR(B) = 70/14

PR(C) = 70/14

PR(D) = 70/14



As to be expected since our example site has no outbound links, after adding page D, the

accumulated PageRank of all pages increases by one from 33 to 34. Further, the PageRank of

page A rises marginally. In contrast, the PageRank of pages B and C depletes substantially.



The Reduction of PageRank by Additional Pages

By adding pages to a hierarchically structured websites, the consequences for the already

existing pages are nonuniform. The consequences for websites with a different structure shall

be shown by another example.



We take a look at a website constisting of three pages A, B and C which are linked to each

other in circle. The pages are then joined by page D which fits into the circular linking structure.

The regarded site has no outbound links. Again, a link from page X which has no other

outbound links and a PageRank of 10 points to page A. At a damping factor d of 0.75, the

equations for the single pages' PageRank values before adding page D are given by

PR(A) = 0.25 + 0.75 (10 + PR(C))

PR(B) = 0.25 + 0.75 × PR(A)

PR(C) = 0.25 + 0.75 × PR(B)

Solving the equations gives us the follwing PageRank values:

PR(A) = 517/37 = 13.97

PR(B) = 397/37 = 10.73

PR(C) = 307/37 = 8.30

After adding page D, the equations for the pages' PageRank values are given by

PR(A) = 0.25 + 0.75 (10 + PR(D))

PR(B) = 0.25 + 0.75 × PR(A)

PR(C) = 0.25 + 0.75 × PR(B)

PR(D) = 0.25 + 0.75 × PR(C)

Solving these equations gives us the follwing PageRank values:

PR(A) = 419/35 = 11.97

PR(B) = 323/35 = 9.23

PR(C) = 251/35 = 7.17

PR(D) = 197/35 = 5.63



Again, after adding page D, the accumulated PageRank of all pages increases by one from 33

to 34. But now, any of the pages which already existed before page D was added lose

PageRank. The more uniform PageRank is distributed by the links within a site, the more likely

will this effect occur.

Since adding pages to a site often reduces PageRank for already existing pages, it becomes

obvious that the PageRank algorithm tends to privilege smaller web sites. Indeed, bigger web

sites can counterbalance this effect by being more attractive for other webmasters to link to

them, simply because they have more content.



None the less, it is also possible to increase the PageRank of existing pages by additional

pages. Therefore, it has to be considered that as few PageRank as possible is distributed to

these additional pages





The Distribution of PageRank Regarding Search Engine Optimisation

Up to this point, it has been described how the number of pages and the number of inbound and

outbound links, respectively, influence PageRank. Here, it will mainly be discussed in how far

PageRank can be affected for the purpose of search engine optimisation by a website's internal

linking structure.

In most cases, websites are hierachically structured to a certain extend, as it is illustrated in our

example of a web site consisting of the pages A, B and C. Normally, the root page is withal

optimised for the most important search phrase. In our example, the optimised page A has an

external inbound link from page X which has no other outbound links and a PageRank of 10.

The pages B and C each receive a link from page A and link back to it. If we set the damping

factor d to 0.5 the equations for the single pages' PageRank values are given by



PR(A) = 0.5 + 0.5 (10 + PR(B) + PR (C))

PR(B) = 0.5 + 0.5 (PR(A) / 2)

PR(C) = 0.5 + 0.5 (PR(A) / 2)

Solving the equations gives us the follwing PageRank values:

PR(A) = 8

PR(B) = 2.5

PR(C) = 2.5



It is generally not advisable to solely work on the root page of a site for the purpose of search

engine optimisation. Indeed, it is, in most cases, more reasonable to optimise each page of a

site for different search phrases.



We now assume that the root page of our example website provides satisfactory results for its

search phrase, but the other pages of the site do not, and therefore we modify the linking

structure of the website. We add links from page B to page C and vice versa to our formerly

hierarchically structured example site. Again, page A has an external inbound link from page X

which has no other outbound links and a PageRank of 10. At a damping factor d of 0.5, the

equations for the single pages' PageRank values are given by



PR(A) = 0.5 + 0.5 (10 + PR(B) / 2 + PR(C) / 2)

PR(B) = 0.5 + 0.5 (PR(A) / 2 + PR(C) / 2)

PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B) / 2)

Solving the equations gives us the follwing PageRank values:

PR(A) = 7

PR(B) = 3

PR(C) = 3



The result of adding internal links is an increase of the PageRank values of pages B and C, so

that they likely will rise in search engine result pages for their targeted keywords. On the other

hand, of course, page A will likely rank lower because of its diminished PageRank.

Generally spoken, PageRank will distribute for the purpose of search engine optimisation more

equally among the pages of a site, the more the hierarchically lower pages are interlinked.



Well Directed PageRank Distribution by Concentration of Outbound Links

It has already been demonstrated that external outbound links tend to have negative effects on

the PageRank of a website's web pages. Here, it shall be illustrated how this effect can be

reduced for the purpose of search engine optimisation by the systematic arrangement of

external outbound links.

We take a look at another hierarchically structured example site consisting of the pages A, B, C

and D. Page A has links to the pages B, C and D. Besides a link back to page A, each of the

pages B, C and D has one external outbound link. None of those external pages which receive

links from the pages B, C and D link back to our example site. If we assume a damping factor d

of 0.5, the equations for the calculation of the single pages' PageRank values are given by



PR(A) = 0.5 + 0.5 (PR(B) / 2 + PR(C) / 2 + PR(D) / 2)

PR(B) = PR(C) = PR(D) = 0.5 + 0.5 (PR(A) / 3)

Solving the equations gives us the follwing PageRank values:

PR(A) = 1

PR(B) = 2/3

PR(C) = 2/3

PR(D) = 2/3



Now, we modify our example site in a way that page D has all three external outbound links

while pages B and C have no more external outbound links. Besides this, the general conditions

of our example stay the same as above. None of the external pages which receive a link from

pages D link back to our example site. If we, again, assume a damping factor d of 0.5, the

equations for the calculations of the single pages' PageRank values are given by

PR(A) = 0.5 + 0.5 (PR(B) + PR(C) + PR(D) / 4)

PR(B) = PR(C) = PR(D) = 0.5 + 0.5 (PR(A) / 3)



Solving these equations gives us the follwing PageRank values:

PR(A) = 17/13

PR(B) = 28/39

PR(C) = 28/39

PR(D) = 28/39



As a result of our modifications, we see that the PageRank values for each single page of our

site have increased. Regarding search engine optimisation, it is therefore advisable to

concentrate external outbound links on as few pages as possible, as long as it does not lessen

a site's usabilty.





Link Exchanges for the purpose of Search Engine Optimisation

For the purpose of search engine optimisation, many webmasters exchange links with others to

increase link popularity. As it has already been shown, adding links within closed systems of

web pages has no effects on the accumulated PageRank of those pages. So, it is questionable

if link exchanges have positive consequences in terms of PageRank at all.



To show the effects of link exchanges, we take a look at an an example of two hierarchically

structured websites consisting of pages A, B and C and D, E and F, respectively. Within the first

site, page A links to pages B and C and those link back to page A. The second site is structured

accordingly, so that the PageRank values for its pages do not have to be computed explicitly. At

a damping factor d of 0.5, the equations for the single pages' PageRank values are given by

PR(A) = 0.5 + 0.5 (PR(B) + PR(C))

PR(B) = PR(C) = 0.5 + 0.5 (PR(A) / 2)

Solving the equations gives us the follwing PageRank values for the first site

PR(A) = 4/3

PR(B) = 5/6

PR(C) = 5/6

and accordingly for the second site

PR(D) = 4/3

PR(E) = 5/6

PR(F) = 5/6

Now, two pages of our example sites start a link exchange. Page A links to page D and vice

versa. If we leave the general conditions of our example the same as above and, again, set the

damping factor d to 0.5, the equations for the calculations of the single pages' PageRank values

are given by

PR(A) = 0.5 + 0.5 (PR(B) + PR(C) + PR(D) / 3)

PR(B) = PR(C) = 0.5 + 0.5 (PR(A) / 3)

PR(D) = 0.5 + 0.5 (PR(E) + PR(F) + PR(A) / 3)

PR(E) = PR(F) = 0.5 + 0.5 (PR(D) / 3)

Solving these equations gives us the follwing PageRank values:

PR(A) = 3/2

PR(B) = 3/4

PR(C) = 3/4

PR(D) = 3/2

PR(E) = 3/4

PR(F) = 3/4



We see that the link exchange makes pages A and D benefit in terms of PageRank while all

other pages lose PageRank. Regarding search engine optimisation, this means that the exactly

opposite effect compared to interlinking hierachically lower pages internally takes place. A link

exchange is thus advisable, if one page (e.g. the root page of a site) shall be optimised for one

important key phrase.



A basic premise for the positive effects of a link exchange is that both involved pages propagate

a similar amount of PageRank to each other. If one of the involved pages has a significantly

higher PageRank or fewer outbound links, it is likely that all of its site's pages lose PageRank.

Here, an important influencing factor is the size of a site. The more pages a web site has, the

more PageRank from an inbound link is distributed to other pages of the site, regardless of the

number of outbound links on the page that is involved in the link exchange. This way, the page

involved in a link exchange itself benefits lesser from the link exchange and cannot propagate

as much PageRank to the other page involved in the link exchange. All the influencing factors

should be weighted up against each other bevor one trades links.



Finally, it shall be noted that it is possible that all pages of a site benefit from a link exchange in

terms of PageRank, whereby also the other site taking part in the link exchange does not lose

PageRank. This may occur, when the page involved in the link exchange already has a certain

number of external outbound links which don't link back to that site. In this case, less PageRank

is lost by the already existing outbound links.





The Yahoo Bonus and its Impact on Search Engine Optimization

Many experts in search engine optimization assume that certain websites obtain a special

PageRank evaluation from the Google search engine which needs a manual intervention and

does not derive from the PageRank algorithm directly. Mostly, the directories Yahoo and Open

Directory Project (dmoz.org) are considered to get this special treatment. In the context of

search engine optimization, this assumption would have the consequence that an entry into the

above mentioned directories had a big impact on a site's PageRank.



An approach that is often mentioned when talking about influencing the PageRank of certain

websites is to assign higher starting values to them before the iterative computation of

PageRank begins. Such proceeding shall be reviewed by looking at a simple example web

consisting of two pages, whereby each of these pages solely links to the other. We assign an

initial PageRank of 10 to one page and a PageRank of 1 to the other. The damping factor d is

set to 0.1, because the lower d is, the faster the PageRank values converge during the

iterations. So, we get the following equations for the computation of the pages' PageRank

values:



PR(A) = 0.9 + 0.1 PR(B)

PR(B) = 0.9 + 0.1 PR(A)

During the iterations, we get the following PageRank values:



Iteration PR (A) PR (B)

0 1 10

1 1.9 1.09

2 1.009 1.0009

3 1.00009 1.000009



It is obvious that despite assigning different starting values to the two pages each of the

PageRank values converges to 1, just as it would have happened if no initial values were

assigned. Hence, starting values have no effect on PageRank if a sufficient number of iterations

takes place. Indeed, if the computation is performed with only few iterations, the starting values

would influence PageRank. But in this case, we have to consider that in our example the

PageRank relation between the two pages reverses after the first iteration. However, it shall be

noted that for our computation the actual PageRank values within one iteration have been used

and not the ones from the previous iteration. If those values would have been used, the

PageRank relation had alternated after each iteration.



Modification of the PageRank Algorithm

If assigning special starting values at the begin of the PageRank calculations has no effect on

the results of the computation, this does not mean that it is not possible to influence the

PageRank of websites or web pages by an intervention in the PageRank algorithm. Lawrence

Page, for instance, describes a method for a special evaluation of web pages in his PageRank

patent specifications (United States Patent 6,285,999). The starting point for his consideration is

that the random surfer of the Random Surfer Model may get bored and stop following links with

a constant probabilty, but when he restarts, he won't take a random jump to any page of the

web but will rather jump to certain web pages with a higher probability than to others. This

behaviour is closer to the behaviour of a real user, who would more likely use, for instance,

directories like Yahoo or ODP as a starting point for surfing.

If a special evaluation of certain web pages shall take place, the original PageRank algorithm

has to be modified. With another expected value implemented, the algorithm is given as follows:



PR(A) = E(A) (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))



Here, (1-d) is the probability for the random surfer no longer following links. E(A) is the

probability for the random surfer going to page A after he has stopped following links, weighted

by the number of web pages. So, E is another expected value whose average over all pages is

1. In this way, the average of the PageRank values of all pages of the web will continue to

converge to 1. Thus, the PageRank values do not vaccilate because of the special evaluation of

web pages and the impact of PageRank on the general ranking of web pages remains stable.



In our example, we set the probability for the random surfer going to page A after he has

stopped following links to 0.1. The probability for him going to page B is set to 0.9. Since our

web consists of two pages E(A) equals 0.2 and E(B) equals 1.8. At a damping factor d of 0.5 we

get the following equations for the calculation of the single pages' PageRank values:



PR(A) = 0.2 × 0.5 + 0.5 × PR(B)

PR(B) = 1.8 × 0.5 + 0.5 × PR(A)

If we solve these equations we get the following PageRank values:

PR(A) = 11/15

PR(B) = 19/15



The sum of the PageRank values remains 2. The higher probability for the random surfer

jumping to page B is reflected by its higher PageRank. Indeed, the uniform interlinking between

both pages prevents our example pages' PageRank values from a more significant impact of

our intervention.

So, it is possible to implement the special evaluation of certain web pages into the PageRank

algorithm without having to change it fundamentally. It is questionable, indeed, what criteria is

used for the evaluation. Lawrence Page suggests explicitly the utilization of real usage data in

his PageRank patent specifications. Google, meanwhile, collects usage date by means of the

Google Toolbar. And Google would not even need as much data, as if the whole ranking was

solely based on usage data. A limited sample would be sufficient to determine the 1,000 or

10,000 most important pages on the web. The PageRank algorithm can then fill the holes in

usage data and is thereby able to deliver a more accurate picture of the web.

Of course, all statements regarding the influence of real usage data on PageRank are pure

speculation. Even if there is a special evaluation of certain web pages at all will in the end, stay

a secret of the people at Google.





Nonetheless, Assigning Starting Values?

Although, assigning special starting values to pages at the begin of PageRank calculations has

no effect on PageRank values it can, nonetheless, be reasonable.



We take a look at our example web consisting of the pages A, B and C, whereby page A links to

the pages B and C, page B links to page C and page C links to page A. In this case, the

damping factor d is set to 0.75. So, we get the following equations for the iterative computation

of the single pages' PageRank values:



PR(A) = 0.25 + 0.75 PR(C)

PR(B) = 0.25 + 0.75 (PR(A) / 2)

PR(C) = 0.25 + 0.75 (PR(A) / 2 + PR(B))

Basically, it is not necessary to assign starting values to the single pages before the

computation begins. They simply start with a value of 0 and we get the following PageRank

values during the iterations:



Iteration PR(A) PR(B) PR(C)

0000

1 0.25 0.34375 0.60156

2 0.70117 0.51294 0.89764

3 0.92323 0.59621 1.04337

4 1.03253 0.63720 1.11510

5 1.08632 0.65737 1.15040

6 1.11280 0.66730 1.16777

7 1.12583 0.67219 1.17633

8 1.13224 0.67459 1.18054

9 1.13540 0.67578 1.18261

10 1.13696 0.67636 1.18363

11 1.13772 0.67665 1.18413

12 1.13810 0.67679 1.18438

13 1.13828 0.67686 1.18450

14 1.13837 0.67689 1.18456

15 1.13842 0.67691 1.18459

16 1.13844 0.67692 1.18460

17 1.13845 0.67692 1.18461

18 1.13846 0.67692 1.18461

19 1.13846 0.67692 1.18461

20 1.13846 0.67692 1.18461

21 1.13846 0.67692 1.18461

22 1.13846 0.67692 1.18462



If we assign 1 to each page before the computation starts, we get the following PageRank

values during the iterations:



Iteration PR(A) PR(B) PR(C)

0111

1 1 0.625 1.09375

2 1.07031 0.65137 1.13989

3 1.10492 0.66434 1.16260

4 1.12195 0.67073 1.17378

5 1.13034 0.67388 1.17928

6 1.13446 0.67542 1.18199

7 1.13649 0.67618 1.18332

8 1.13749 0.67656 1.18398

9 1.13798 0.67674 1.18430

10 1.13823 0.67684 1.18446

11 1.13835 0.67688 1.18454

12 1.13840 0.67690 1.18458

13 1.13843 0.67691 1.18460

14 1.13845 0.67692 1.18461

15 1.13845 0.67692 1.18461

16 1.13846 0.67692 1.18461

17 1.13846 0.67692 1.18461

18 1.13846 0.67692 1.18461

19 1.13846 0.67692 1.18462



If we now assign a starting value to each page, which is closer to its effective PageRank (1.1 for

page A, 0.7 for page B and 1.2 for page C), we get the following results:



Iteration PR(A) PR(B) PR(C)

0 1.1 0.7 1.2

1 1.15 0.68125 1.19219

2 1.14414 0.67905 1.18834

3 1.14126 0.67797 1.18645

4 1.13984 0.67744 1.18552

5 1.13914 0.67718 1.18506

6 1.13879 0.67705 1.18483

7 1.13863 0.67698 1.18472

8 1.13854 0.67695 1.18467

9 1.13850 0.67694 1.18464

10 1.13848 0.67693 1.18463

11 1.13847 0.67693 1.18462

12 1.13847 0.67692 1.18462

13 1.13846 0.67692 1.18462



So, the closer the assigned starting values are to the effective results we would get by solving

the equations, the faster do the PageRank values converge in the iterative computation. Less

iterations are needed, which can be useful for providing more up to date search results,

especially regarding the growth rate of the web. Starting point for an accurate presumption of

the actual PageRank distribution may be the PageRank values of a former PageRank

calculation. All the pages which are new in the index could get an initial PageRank of 1, which

will then be a lot closer to the effective PageRank value after the first few iterations.





Additional Factors Influencing PageRank

It has been widely discussed if additional criteria beyond the link structure of the web have been

implemented in the PageRank algorithm since the scientific work on PageRank has been

published by Lawrence Page and Sergey Brin. Lawrence Page himself outlines the following

potential influencing factors in his patent specifications for PageRank:





Visibility of a link

Position of a link within a document

Distance between web pages

Importance of a linking page

Up-to-dateness of a linking page

First of all, the implementation of additional criteria in PageRank would result in a better

approximation of human usage regarding the Random Surfer Model. Considering the visibility of

a link and its position within a document implies that a user does not click on links completely at

haphazard, but rather follows links which are highly and immediately visible regardless of their

anchor text. The other criteria would give Google more flexibility in determing in how far an

inbound link of a page should be considered important, than the methods which have been

described so far.

Whether or not the above mentioned factors are actually implemented in PageRank can not be

proved empirically and shall not be discussed here. It shall rather be illustrated in which way

additional influencing factors can be implemented in the PageRank algorithm and which options

the Google search engine thereby gets in terms of influencing PageRank values.



Modification of the PageRank Algorithm

To implement additional factors in PageRank, the original PageRank algorithm has again to be

modified. Since we have to assume that PageRank calculations are still based on numerous

iterations and for the purpose of short computation times, we have to consider to keep the

number of database queries during the iterations as small as possible. Therefore, the following

modification of the PageRank algorithm shall be assumed:

PR(A) = (1-d) + d (PR(T1)×L(T1,A) + ... + PR(Tn)×L(Tn,A))

Here, L(Ti,A) represents the evaluation of a link which points from page Ti to page A. L(Ti,A)

withal replaces the PageRank weighting of page Ti by the number of outbound links on page Ti

which was given by 1/C(Ti). L(Ti,A) may consist of several factors, each of them having to be

determined only once and then being merged to one value before the iterative PageRank

calculation begins. So, the number of database queries during the iterations stays the same,

although, admittedly, a much larger database has to be queried at each step in comparison to

the computation by use of the original algorithm, since now there is an evaluation of each link

instead of an evaluation of pages (by the number of their outbound links).



Different Evaluation of Links within a Document

Two of the criteria for the evaluation of links mentioned by Lawrence Page in his PageRank

patent specifications are the visibilty of a link and its position within a document. Regarding the

Random Surfer Model, those criteria reflect the probability for the random surfer clicking on a

link on a specific web page. In the original PageRank algorithm, this probability is given by the

term (1/C(Ti)), whereby the probability is equal for each link on one page.

Assigning different probabilities to each link on a page can, for instance, be realized as follows:

We take a look at a web consisting of three pages A, B anc C, where each of these pages has

outbound links to both of the other pages. Links are weighted by two evaluation criteria X and Y.

X represents the visibility of a link. X equals 1 if a link is not particularly emphasized, and 2 if the

link is, for instance, bold or italic. Y represents the position of a link within a document. Y equals

1 if the link is on the lower half of the page, and 3 if the link is on the upper half of the page. If

we assume a multiplicative correlation between X and Y, the links in our example are evaluated

as follows:

X(A,B) × Y(A,B) = 1 × 3 = 3

X(A,C) × Y(A,C) = 1 × 1 = 1

X(B,A) × Y(B,A) = 2 × 3 = 6

X(B,C) × Y(B,C) = 2 × 1 = 2

X(C,A) × Y(C,A) = 2 × 3 = 6

X(C,B) × Y(C,B) = 2 × 1 = 2

For the purpose of determinig the single factors L, the evaluated links must not simply be

weighted by the number of outbound links on one page, but in fact by the total of evaluated links

on the page. Thereby, we get the following weighting quotients Z(Ti) for the single pages Ti:

Z(A) = X(A,B) × Y(A,B) + X(A,C) × Y(A,C) = 4

Z(B) = X(B,A) × Y(B,A) + X(B,C) × Y(B,C) = 8

Z(C) = X(C,A) × Y(C,A) + X(C,B) × Y(C,B) = 8

The evaluation factors L(T1,T2) for a link which is pointing from page T1 to page T2 are hence

given by

L(T1,T2) = X(T1,T2) × Y(T1,T2) / Z(T1)

Their values regarding our example are as follows:

L(A,B) = 0.75

L(A,C) = 0.25

L(B,A) = 0.75

L(B,C) = 0.25

L(C,A) = 0.75

L(C,B) = 0.25

At a damping factor d of 0.5, we get the following equations for the calculation of PageRank

values:

PR(A) = 0.5 + 0.5 (0.75 PR(B) + 0.75 PR(C))

PR(B) = 0.5 + 0.5 (0.75 PR(A) + 0.25 PR(C))

PR(C) = 0.5 + 0.5 (0.25 PR(A) + 0.25 PR(B))

Solving these equations gives us the follwing PageRank values for our example:

PR(A) = 819/693

PR(B) = 721/693

PR(C) = 539/693

First of all, we see that page A has the highest PageRank of all three pages. This is caused by

page A receiving the relatively higher evaluated link from page B as well as from page C.



Furthermore, we see that even by the evaluation of single links, the sum of the PageRank

values of all pages equals 3 (2079/693) and thereby the total number of pages. So, the

PageRank values computed by our modified PageRank algorithm can be used for the general

ranking of web pages by Google without any normalisation being needed.



Different Evaluation of Links by Page Specific Criteria

Besides the unequal evaluation of links within a document, Lawrence Page mentions the

possibility of evaluating links according to criteria which are based upon the linking page. At first

glance, this does not seem necessary since it is the main principle of PageRank to rank pages

the higher, the more high ranking pages link to them. But, at the time of their scientific work on

PageRank, Page and Brin have already recognized that their algorithm is vulnerable to artificial

inflation of PageRank.



An artificial influence on PageRank might be exerted by webmasters who generate a multitude

of web pages whose links distribute PageRank in a way that single pages within that system

receive a special importance. Those pages can have a high PageRank without being linked to

from other pages with high PageRank. So, not only the concept of PageRank is undermined,

but also the search engine's index is spammed with an innumerable amount of web pages

which were solely created to influence PageRank.



In his patent specifications for PageRank, Lawrence Page presents the evaluation of links by

the distance between pages as a means to avoid the artificial inflation of PageRank, because

the bigger the distance between two pages, the less likely has one webmaster control over both.

A criterium for the distance between two pages may be if they are on the same domain or not.

In this way, internal links would be weighted less than external links. In the end, any general

measure of the distance between links can be used to determine such a weighting. This

comprehends if pages are on the same server or not and also the geographical distance

between servers.



As another indicator for the importance of a document, Lawrence Page mentions the up-to-

dateness of the documents which link to it. This argument considers that the information on a

page is less likely outdated, the more pages which have been modified recently link to it. In

contrast, the original PageRank concept, just like any method of measuring link popularity,

favours older documents which gained their inbound links in the course of their existence and

have at a higher probability been modified less recently than new documents. Basically, recently

modified documents may be given a higher evaluation by weighting the factor (1-d). In this way,

both those recently modified documents and the pages they link to receive a higher PageRank.

But, if a page has been modified recently, is not necessarily an indicator for the importance of

the information presented on it. So, as suggested by Lawrence Page, it is advisable not to

favour recently modified pages but only their outbound links.



Finally, Page mentions the importance of the web location of a page as an indicator of the

importance of its outbound links. As an example for an important web location he names the

root page of a domain, but, in the end, Google could exert influence on PageRank absolutely

arbitrarily.

To implement the evaluation of the linking page into PageRank, the evaluation factor of the

modified algorithm must consist of several single factors. For a link that points from page Ti to

page A, it can be given as follows:

L(Ti,A) = K(Ti,A) × K1(Ti) × ... × Km(Ti)

where K(Ti,A) is the above presented weighting of a single link within a page by its visibility or

position. Additionally, an evaluation of page Ti by m criteria which are represented by the factors

Kj(Ti) takes place.



To implement the evaluation of the linking pages, not only the algorithm but also the

proceedings of PageRank calculation have to be modified. This shall be illustrated by an

example.



We take a look at a web consisting of three pages A, B and C, whereby page A links to the

pages B and C, page B links to page C and page C links to page A. The outbound links of one

page are evaluated equally, so there is no weighting by visibilty or position. But now, the pages

are evaluated by one criterium. In this way, an inbound link from page C shall be considered

four times as important as an inbound link from one of the other pages. After weighting by the

number of pages, we get the following evaluation factors:

K(A) = 0.5

K(B) = 0.5

K(C) = 2

At a damping factor d of 0.5, the equations for the computation of the PageRank values are

given by

PR(A) = 0.5 + 0.5 × 2 PR(C)

PR(B) = 0.5 + 0.5 × 0.5 × 0.5 PR(A)

PR(C) = 0.5 + 0.5 (0.5 PR(B) + 0.5 × 0.5 PR(A))

Solving the equations gives us the follwing PageRank values:

PR(A) = 4/3

PR(B) = 2/3

PR(C) = 5/6

At the current modifications of the PageRank algorithm, the accumulated PageRank of all pages

no longer equals the number of pages. The reason therefore is that the weighting of the page

evaluation by the number of pages was not appropriate. To determine the proper weighting, the

web's linking structure would have to be anticipated, which is not possible in case of the actual

WWW. Therefore, the PageRank calculated by an evaluation of linking pages has to be

normalized if there shall not be any unfounded effects on the general ranking of pages by

Google. Within the iterative calculation, a normalization would have to take place after each

iteration to minimize unintentional distortions.



In the case of a small web, the evaluation of pages often causes severe distortions. In the case

of the actual WWW, these distortions should normally equalise by the number of pages. Indeed,

it is to be expected that the evaluation of the distance between pages will cause distortions on

PageRank, since pages with many inbound links surely tend to be linked to from different

geographical regions. But such effects can be anticipated by experience from previous

calculation periods, so that a normalisation would only have to be marginal.



In either case, implementing additional factors in PageRank is possible. Indeed, the

computation of PageRank values would take more time





Theme-based PageRank

For many years now, the topic- or theme-based homogeneity of websites has been dicussed as

a possible ranking criterion of search engines. There are various theoretical approaches for the

implementation of themes in search engine algorithms which all have in common that web

pages are no longer ranked only based on their own content, but also based on the content of

other web pages. For example, the content of all pages of one website may take influence on

the ranking of one single page of this website. On the other hand, it is also conceivable that one

page's ranking is based on the content of those pages which link to it or which it links to itself.



The potential implementation of a theme-based ranking in the Google search engine is

discussed controversially. In search engine optimization forums and on websites on this topic

we can over and over again find advice that inbound links from sites with a similar theme to our

own have a larger influence on PageRank than links from unrelated sites. This hypothesis shall

be discussed here. Therefore, we first of all take a look at two relatively new approaches for the

integration of themes in the PageRank technique: on the one hand the "intelligent surfer" by

Matthew Richardson and Pedro Domingos and on the other hand the Topic-Sensitive

PageRank by Taher Haveliwala. Subsequently, we take a look at the possibility of using content

analyses in order to compare the text of web pages, which can be a basis for weighting links

within the PageRank technique.



The "Intelligent Surfer" by Richardson and Domingos

Matthew Richardson and Pedro Domingos resort to the Random Surfer Model in order to

explain their approach for the implementation of themes in the PageRank technique. Instead of

a surfer who follws links completely at random, they suggest a more intelligent surfer who, on

the one hand, only follows links which are related to an original search query and, on the other

hand, also after "getting bored" only jumps to a page which relates to the original query.



So, to Richardson and Domingos' "intelligent surfer" only pages are relevant that contain the

search term of an initial query. But since the Random Surfer Model does nothing but illustrate

the PageRank technique, the question is how an "intelligent" behaviour of the Random Surfer

influences PageRank. The answer is that for every term occuring on the web a separate

PageRank calculation has to be conducted and each calculation is solely based on links

between pages which contain that term.



Computing PageRank this way causes some problems. They especially appear for search

terms that do not occur so often on the web. To make it into the PageRank calculations for a

specific search term, that term has not only to appear on someone's page, but also on the

pages that link to it. So, the search results would often be based on small subsets of the web

and may omit relevant sites. In addition, using such small subsets of the web, the algorithms are

more vulnerable to spam by automatically generating numerous pages.



Additionally, there are serious problems regarding scalability. Richardson and Domingos

estimate the memory and computing time requirements for several 100,000 terms 100-200

times higher compared to the original PageRank calculations. Regarding the large number of

small subsets of the web, these numbers appear to be realistic.



The higher memory requirements should not be so much of a problem because Richardson and

Domingos correctly state that the term specific PageRank values constitute only a fraction of the

data volume of Google's inverse index. However, the computing time requirements are indeed a

large problem. If we assume just five hours for a conventional PageRank calculation, then this

would last about 3 weeks based on Richardson and Domingos' model, which makes it

unsuitable for actual employment.



Taher Haveliwala's Topic-Sensitive PageRank

The approach of Taher Havilewala seems to be more practical for actual employment. Just like

Richardson and Domingos, also Havilewala suggests the computation of different PageRanks

for different topics. However, the Topic-Sensitive PageRank does not consist of hundreds of

thousands of PageRanks for different terms, but of a few PageRanks for different topics. Topic-

Sensitive PageRank is based on the link structure of the whole web, whereby the topic

sensitivity implies that there is a different weighting for each topic.



The basic principle of Haveliwala's approach has already been described in our section on the

"Yahoo-Bonus", where we have discussed the possibility to assign a particular imporance to

certain web pages. In the words of the Random Surfer Model, this is realized by increasing the

probability for the Random Surfer jumping to a page after "getting bored". Via links, this manual

intervention in the PageRank technique has an influence on the PageRank of each page on the

web. More precisely, we have reached taking influence on PageRank by implementing another

value E in the PageRank algorithm:

PR(A) = E(A) (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))



Haveliwala's Topic-Sensitive-PageRank goes one step further. Instead of assigning a

universally higher value to a website or a web page, Haveliwala differentiates on the basis of

different topics. For each of these topics, he identifies other authority pages. On the basis of this

evaluation, different PageRanks are calculated - each separately, but for the entire web.

For his experiments on Topic-Sensitive PageRank, Haveliwala has chosen the 16 top-level

categories of the Open Directory Project both for the identification of topics and for the

intervention in PageRank. More precisely, Haveliwala assigns a higher value E to the pages of

those ODP categories for which he calculates PageRank. If, for example, he calculates the

PageRank for the topic health, all the ODP pages in the health category receive a relatively

higher value E and they pass this value in the form of PageRank on to the pages which are

linked from there. Of course, this PageRank is passed on to other pages and, if we assume that

health-related websites tend to link more often to other websites within that topic, pages on the

topic health generally receive a higher PageRank.



Haveliwala confirms the incompleteness of choosing the Open Directory Project in order to

identify topics, which for example results in a high degree of dependence on ODP editors and in

a rather rough subdivision into topics. But, as Haveliwala states, his method shows good results

and it can surely be improved without big effort.



However, one crucial point in Haveliwala's work on Topic-Sensitive-PageRank is the

identification of the user's preferences. Having a Topic-Specific PageRank is useless as long as

we do not know in which topics an actual user is interested. In the end, search results must be

based on the PageRank that matches the user's preferences best. The Topic-Sensitive

PageRank can only be used if these are known.



Indeed, Haveliwala does supply some practicable approaches for the identification of user

preferences. He describes, for example, the search in context by highlighting terms on a web

page. In this way, the content of that web page could be an indicator for waht the user is looking

for. At this point, we want to note the potential of the Google Toolbar. The Toolbar submits data

regarding search terms and pages that a user has visited to Google. This data can be used to

create user profiles which can then be a basis for the identification of the user's preferences.

However, even without using such techniques, it is imaginable that a user simply chooses the

topic he is interested in before he does a query.



The Weighting of Links Based on Content Analyses

That it is possible to weight single links within the PageRank technique has been shown on the

previous page. The thought behind weighting links based on content analyses is to avoid the

corrumption of PageRank. By weighting links this way, it is theoretically possible to diminish the

influence of links between thematically unrelated page, which have been set for the sole

purpose of boosting PageRank of one page. Indeed, it is questionable if it is possible to realize

such weighting based on content analyses.



The fundamentals of content analyses are based on Gerard Salton's work in the 1960s and

1970s. In his vector space model of information retrieval, documents are modeled as vectors

which are built upon terms and their weighting within the document. These term vectors allow

comparisons between the content of documents by, for instance, calculating the cosine

measure (the inner product) of the vectors. In its basic form, the vector space model has some

weaknesses. For instance, often the assumption that if and in how far the same words appear in

two documents is an indicator for their similarity is criticized. However, numerous enhancements

have been developed that solve most of the problems of the vector space model.



One person who excelled at publications which are based on Salton's vector space model is

Krishna Bharat. This is interesting because Bharat meanwhile is a member of Google's staff

and, particularly, because he is deemded to be the developer of "Google News"

(news.google.com). Google News is a service that crawls news websites, evaluates articles and

then provides them categorized and grouped in different subjects on the Google New website.

According to Google, all these procedures are completely automated. Therefore, other criteria

like, for example, the time when an article is published, are taken into account, but if there is no

manual intervention, the clustering of articles is most certainly only possible, if the contents of

the articles are actually compared to each other. The questions is: How can this be realized?



In their publication on a term vector database, Raymie Stata, Krishna Bharat and Farzin

Maghoul describe how the contents of web pages can be compared based on term vectors and,

particularly, they describe how some of the problems with the vector space model can be

solved. Firstly, not all terms in documents are suitable for content analsysis. Very frequent terms

provide only little discrimination across vectors and, so, the most frequent third of all terms is

eliminated from the database. Infrequent terms, on the other hand, do not provide a good basis

for measuring similarity. Such terms are, for example, misspellings. They appear only on few

pages which are likely unrelated in terms of their theme, but because they are so infrequent, the

term vectors of the pages appear to be closely related. Hence, also the least frequent third of all

terms is eliminated from the database.



Even if only one third of all terms is included in the term vectors, this selection is still not very

efficient. Stata, Bharat and Maghoul perform another filtering, so that each term vector is based

on a maximum of 50 terms. But these are not the 50 most frequent terms on a page. They

weight a term by deviding the number of times it appears on a page by the number of times it

appears on all pages, and those 50 terms with the highest weight are included in the term vector

of a page. This selection actually allows a real differentiation between the content of pages.



The methods described above are standards for the vector space model. If, for example, the

inner product of two term vectors is rather high, the contents of the according pages tend to be

similar. This may allow content comparisons in many areas, but it is doubtful if it is a good basis

for weighting links within the PageRank technique. Most of all, synonyms and terms that

describe similar things can not be identified. Indeed, there are algorithms for word stemming

which work good for the english language, but in other languages word stemming is much more

complicated. Different languages are a general problem. Unless, for instance, brand names or

loan words are used, texts in different languages normally do not contain the same terms. And if

they do, these terms normally have a completely different meaning, so that comparing content

in different languages is not possible. However, Stata, Bharat and Maghoul provide a method of

resolution for these problems.



Stata, Bharat und Maghoul present a concrete application for their Term Vector Database by

classifying pages thematically. Bharat has also published on this issue together with Monika

Henzinger, presently Google's Research Director, and they called it "topic distillation". Topic

distillation is based on calculating so-called topic vectors. Topic vectors are term vectors, but

they do not only include terms of one page but rather the terms of many pages which are on the

same topic. So, in order to create topic vectors, they have to know a certain amount of web

pages which are on several pre-defined topics. To achieve this, they resort to web directories.



For their application, Stata, Bharat und Maghoul have crawled about 30,000 links within each of

the then 12 main categories of Yahoo to create topic vectors which include about 10,000 terms

each. Then, in order to identify the topic of any other web page, they matched the according

term vector with all the topic vectors which were created from the Yahoo crawl. The topic of a

web page derived from the topic vector which matched the term vector of the web page best.

That such a classification of web pages works can again be observed by the means of Google

News. Google News does not only merge articles to one news topic, but also arranges them to

the categories World, U.S., Business, Sci/Tech, Sports, Entertainment and Health. As long as

this categorization is not based on the structure of the website where the articles come from

(which is unlikely), the actual topic of an article has in fact to be computed.



At the time he published on term vectors, Krishna Bharat did not work on PageRank but rather

on Kleinberg's algorithm, so that he was more interested in filtering off-topic links than in

weighting links. But from classifying pages to weighting links based on content comparisons,

there is only a small step. Instead of matching the term vectors of two pages, it is much more

efficient to match the topics of two pages. We can, for instance, create a "topic affinity vector"

for each page based on the degree of affinity of the page's term vector and all the topic vectors.

The better the topic affinity vectors of two pages match, the more likely are they on the same

topic and the higher should a link between them be weighted.



Using topic vectors has one big advantage over comparing term vectors directly: A topic vector

can include terms in different languages by being based on, for instance, the links on different

national Yahoo versions. Deviant site structures of the national versions can most certainly be

adapted manually. Even better may be using the ODP because the structure of the sub-

categories of the "World" category is based on the main OPD structure. In this way, measuring

topic similarities between pages in different languages can be realized, so that a really useful

weighting of links based on text analyses appears to be possible.



Is there an Actual Implementation of Themes in PageRank?

That both the approach of Havelivala and the approach of Richardson and Domingos are not

utilized by Google is obvious. One would notice it using Google. However, a weighting of links

based on text analyses would not be apparent immediately. It has been shown that it is

theoretically possible. But it is doubtful that it is actually implemented.



We do not want to claim that we have shown the only way of weighting links on the basis of text

analyses. Indeed, there are certainly dozens of others. However, the approach that we provided

here is based on publications of important members of Google's staff and, thus, we want to rest

a critical evaluation on it.



Like always, when talking about PageRank, there is the question if our approach is sufficienly

scalable. On the one hand, it causes additional memory requirements. After all, Stata, Bharat

and Maghoul describe the system architecture of a term vector database which is different from

Google's inverse index, since it maps from page ids to terms and, so, can hardly be integrated

in the existing architecture. At the actual size of Google's index, the additional memory

requirements should be several hundred GB to a few TB. However, this should not be so much

of a problem since Google's index is most certainly several times bigger. In fact, the time

requirements for building the database and for computing the weigtings appear to be the critical

part.

Building a term verctor database should be approximately as time-consuming as building an

inverse index. Of course, many procecces can probably be used for building both but if, for

instance, the weighting of terms in the term vectors has to differ from the weighting of terms in

the inverse index, the time requirements remain substantial. If we assume that, like in our

approach, content analyses are based on computing the inner products of topic affinity vectors

which have to be calculated by matching term vectors and topic vectors, this process should be

approximately as time-consuming as computing PageRank. Moreover, we have to consider that

the PageRank calculations themselves beome more complicated by weighting links.

So, the additional time requirements are definitely not negligible. This is why we have to ask

ourselves if weighting links based on text analyses is useful at all. Links between thematically

unrelated page, which have been set for the sole purpose of boosting PageRank of one page,

may be annoying, but most certainly they are only a small fraction of all links. Additionally, the

web itself is completely inhomogeneous. Google, Yahoo or the ODP do not owe their high

PageRank solely to links from other search engines or directories. A huge part of the links on

the web are simply not set for the purpose of showing visitors ways to more thematically related

information. Indeed, the motivation for placing links is manifold. Moreover, the problably most

popular websites are completely inhomogeneous in terms of theme. Think about portals like

Yahoo or news websites which contain articles that cover almost any subject of life. A strong

weighting of links as it has been described here could influence those website's PageRanks

significantly.



If the PageRank technique shall not become totally futile, a weighting of links can only take

place to a small extent. This, of course, raises the question if the efforts it requires are

justifiable. After all, there are certainly other ways to eliminate spam which often comes to the

top of search results through thematically unrelated and probably bought links.





PR0 - Google's PageRank 0 Penalty

By the end of 2001, the Google search engine introduced a new kind of penalty for websites

that use questionable search engine optimization tactics: A PageRank of 0. In search engine

optimization forums it is called PR0 and this term shall also be used here. Characteristically for

PR0 is that all or at least a lot of pages of a website show a PageRank of 0 in the Google

Toolbar, even if they do have high quality inbound links. Those pages are not completely

removed from the index but they are always at the end of search results and, thus, they are

hardly to be found.

A PageRank of 0 does not always mean a penalty. Sometimes, websites which seam to be

penalized simply lack inbound links with an sufficiently high PageRank. But if pages of a website

which have formerly been placed well in search results, suddenly show the dreaded white

PageRank bar, and if there have not been any substantial changes regarding the inbound links

of that website, this means - according to the prevailing opinion - certainly a penalty by Google.



We can do nothing but speculate about the causes for PR0 because Google representatives

rarely publish new information on Google's algorithms. But, non the less, we want to give a

theoretical approach for the way PR0 may work because of its serious effects on search engine

optimization.



The Background of PR0

Spam has always been one of the biggest problems that search engines had to deal with. When

spam is detected by search engines, the usual proceeding is the banishment of those pages,

websites, domains or even IP addresses from the index. But, removing websites manually from

the index always means a large assignment of personnel. This causes costs and definitely runs

contrary to Google's scalability goals. So, it appears to be necessary to filter spam

automatically.



Filtering spam automatically carries the risk of penalizing innocent webmasters and, hence, the

filters have to react rather sensibly on potential spam. But then, a lot of spam can pass the

filters and some additional measures may be necessary. In order to filter spam effectively, it

might be useful to take a look at links.

That Google uses link analysis in order to detect spam has been confirmed more or less clearly

in WebmasterWorld's Google News Forum by a Google employee who posts as "GoogleGuy".

Over and over again, he advises webmasters to avoid "linking to bad neighbourhoods". In the

following, we want to specify the "linking to bad neighbourhoods" and, to become more

precisely, we want to discuss how an identification of spam can be realized by the analysis of

link structures. In particular, it shall be shown how entire networks of spam pages, which may

even be located on a lot of different domains, can be detected.

BadRank as the Opposite of PageRank

The theoretical approach for PR0 as it is presented here was initially brought up by Raph Levien

(www.advogato.org/person/raph). We want to introduce a technique that - just like PageRank -

analyzes link structures, but, that unlike PageRank does not determine the general importance

of a web page but rather measures its negative characteristics. For the sake of simplicity this

technique shall be called "BadRank".



BadRank is in priciple based on "linking to bad neighbourhoods". If one page links to another

page with a high BadRank, the first page gets a high BadRank itself through this link. The

similarities to PageRank are obvious. The difference is that BadRank is not based on the

evaluation of inbound links of a web page but on its outbound links. In this sense, BadRank

represents a reversion of PageRank. In a direct adaptation of the PageRank algorithm,

BadRank would be given by the following formula:

BR(A) = E(A) (1-d) + d (BR(T1)/C(T1) + ... + BR(Tn)/C(Tn))

where

BR(A) is the BadRank of page A,

BR(Ti) is the BadRank of pages Ti which are outbound links of page A,

C(Ti) is here the number of inbound links of page Ti and

d is the again necessary damping factor.



In the previously discussed modifications of the PageRank algorithm, E(A) represented the

special evaluation of certain web pages. Regarding the BadRank algorithm, this value reflects if

a page was detected by a spam filter or not. Without the value E(A), the BadRank algorithm

would be useless because it was nothing but another analysis of link structures which would not

take any further criteria into account.



By means of the BadRank algorithm, first of all, spam pages can be evaluated. A filter assigns a

numeric value E(A) to them, which can, for example, be based on the degree of spamming or

maybe even better on their PageRank. Thereby, again, the sum of all E(A) has to equal the total

number of web pages. In the course of an iterative computation, BadRank is not only transfered

to pages which link to spam pages. In fact, BadRank is able to identify regions of the web where

spam tends to occur relatively often, just as PageRank identifies regions of the web which are of

general importance.

Of course, BadRank and PageRank have significant differences, especially, because of using

outbound and inbound links, respectively. Our example shows a simple, hierarchically

structured website that reflects common link structures pretty well. Each page links to every

page which is on a higher hierachical level and on its branch of the website's tree structure.

Each page links to pages which are arranged hierarchically directly below them and,

additionally, pages on the same branch and the same hierarchical level link to each other.



The following table shows the distribution of inbound and outbound links for the hierarchical

levels of such a site.



Level inbound Links outbound Links

062

144

223



As to be expected, regarding inbound links, a hierarchical gradation from the index page

downwards takes place. In contrast, we find the highest number of outbound links on the

website's mid-level. We can see similar results, when we add another level of pages to our

website while the above described linking rules stay the same.



Level inbound Links outbound Links

0 14 2

184

245

324



Again, there is a concentration of outbound links on the website's mid-level. But most of all, the

outbound links are much more evenly distributed than the inbound links.

If we assign a value of 100 to the index page's E(A) in our original example, while all other

values E equal 1 and if the damping factor d is 0.85, we get the following BadRank values:



Page BadRank

A 22.39

B/C 17.39

D/E/F/G 12.21



First of all, we see that the BadRank distributes from the index page among all other pages of

the website. The combination of PageRank and BadRank will be discussed in detail below, but,

no matter how the combination will be realized, it is obvious that both can neutralize each other

very well. After all, we can assume that also the page's PageRank decreases, the lower the

hierarchy level is, so that a PR0 can easily be achieved for all pages.

If we now assume that the hierarchically inferior page G links to a page X with a constant

BadRank BR(X)=10, whereby the link from page G is the only inbound link for page X, and if all

values E for our example website equal 1, we get, at a damping factor d of 0.85, the following

values:



Page BadRank

A 4.82

B 7.50

C 14.50

D 4.22

E 4.22

F 11.22

G 17.18



In this case, we see that the distribution of the BadRank is less homogeneous than in the first

scenario. Non the less, a distribution of BadRank among all pages of the website takes place.

Indeed, the relatively low BadRank of the index page A is remarkable. It could be a problem to

neutralize its PageRank which should be higher compared to the rest of the pages. This effect is

not really desirable but it reflects the experiences of numerous webmasters. Quite often, we can

see the phenomenom that all pages except for the index page of a website show a PR0 in the

Google Toolbar, whereby the index page often has a Toolbar PageRank between 2 and 4.

Therefore, we can probably assume that this special variant of PR0 is not caused by the

detection of the according website by a spam filter, but the site rather received a penalty for

"linking to bad neighbourhoods". Indeed, it is also possible that this variant of PR0 occurs when

only hierarchical inferior pages of a website get trapped in a spam filter.





The Combination of PageRank and BadRank to PR0

If we assume that BadRank exists in the form presented here, there is now the question in

which way BadRank and PageRank can be combined, in order to penalize as much spammers

as possible while at the same time penalizing as few innocent webmasters as possible.

Intuitively, implementing BadRank directly in the actual PageRank computations seems to make

sense. For instance, it is possible to calculate BadRank first and, then, divide a page's

PageRank through its BadRank each time in the course of the iterative calculation of PageRank.

This would have the advantage, that a page with a high BadRank could pass on just a little

PageRank or none at all to the pages it links to. After all, one can argue that if one page links to

a suspect page, all the other links on that page may also be suspect.



Indeed, such a direct connection between PageRank and BadRank is very risky. Most of all, the

actual influence of BadRank on PageRank cannot be estimated in advance. It is to be

considered that we would create a lot of pages which cannot pass on PageRank to the pages

they link to. In fact, these pages are dangling links, and as it has been discussed in the section

on outbound links, it is absolutely necessary to avoid dangling links while computing PageRank.



So, it would be advisable to have separate iterative calculations for PageRank and BadRank.

Combining them afterwards can, for instance, be based on simple arithmetical operations. In

principle, a subtraction would have the desirable consequence that relatively small BadRank

values can hardly have a large influence on relatively high PageRank values. But, there would

certainly be a problem to achieve PR0 for a large number of pages by using the subtraction. We

would rather see a PageRank devaluation for many pages.



Achieving the effects that we know as PR0 seems easier to be realized by dividing PageRank

through BadRank. But this would imply that BadRank receives an extremely high importance.

However, since the average BadRank equals 1, a big part of BadRank values is smaller than 1

and, so, a normalization is necessary. Probably, normalizing and scaling BadRank to values

between 0 and 1 so that "good" pages have values close to 1, and "bad" pages have values

close to 0 and, subsequently, multiplying these values with PageRank would supply the best

results.

A very effective and easy to realize alternative would probably be a simple stepped evaluation

of PageRank and BadRank. It would be reasonable that if BadRank exceeds a certain value it

will always lead to a PR0. The same could happen when the relation of PageRank to BadRank

is below a certain value. Additionally, it would make sense that if BadRank and/or the relation of

BadRank to PageRank is below a certain value, BadRank takes no influence at all.



Only if none of these cases occurs, an actual combination of PageRank and BadRank - for

instance by dividing PageRank through BadRank - would be necessary. In this way, all

unwanted effects could be avoided.



A Critical View on BadRank and PR0

How Google would realize the combination of PageRank and BadRank is of rather minor

importance. Indeed, a separate computation and a subsequent combination of both has the

consequence that it may not be possible to see the actual effect of a high BadRank by looking

at the Toolbar. If a page has a high PageRank in the original sense, the influence of its

BadRank can be negligible. But if another page links to it, this could have quite serious

consequences.



An even bigger problem is the direct reversion of the PageRank algorithm as we have

presented it here: Just as an additional inbound for one page can do nothing but increasing this

page's PageRank, an additional outbound link can only increase its BadRank. This is because

of the addition of BadRank values in the BadRank formula. So, it does not matter how many

"good" outbound links a page has - one link to a spam page can be enough to lead to a PR0.



Indeed, this problem may appear in exceptional cases only. By our direct reversion of the

PageRank algorithm, the BadRank of a page is divided by its inbound links and single links to

pages with high BadRank transfer only a part of that BadRank in each case. Google's Matt

Cutts' remark on this issue is: "If someone accidentally does a link to a bad site, that may not

hurt them, but if they do twenty, that's a problem." (searchenginewatch.com/sereport/02/11-

searchking.html)



However, as long as all links are weighted uniformly within the BadRank computation, there is

another problem. If two pages differ widely in PageRank and both have a link to the same page

with a high BadRank, this may lead to the page with the higher PageRank suffering far less from

the transferred BadRank than the page with the low PageRank. We have to hope that Google

knows how to deal with such problems. Nevertheless it shall be noted that, regarding the

procedure presented here, outbound links can do nothing but harm.



Of course, all statements regarding how PR0 works are pure speculation. But in principle, the

analysis of link structures similarly to the PageRank technique should be the way how only

Google understands to deal with spam.



Related docs
Other docs by Shoaib Ahmed
Google
Views: 2  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!