Email Password Remember?
Daily SEO Blog
Pro + Free SEO Tools
Search Marketing Guides
User Powered YOUmoz
SEO Services Marketplace
Learn More About
SEOmoz has a large collection of free guides relating to SEO, marketing, analytics, and more.
Articles :: Google's Patent - Information Retrieval Based
on Historical Data
This report has been prepared to help SEOs understand the concepts and practical applications
contained in Google's US Patent Application #20050071741 - Information Retrieval Based on
Historical Data . My own advice and interpretation is offered throughout this paper - please
conduct your own research before acting on the recommendations.
Sections in this Report:
I. Overview of the 5 Most Critical Concepts from this Paper
Google's Concept of "Document Inception"
How Changing Content can Affect Rankings
Spam Detection & Punishment
What Google is Attempting to Measure
The Impact of this Patent
II. Analysis and Interpretation of 63 Patent Components
History Data (1)
Inception Date (4)
Frequency of Document Changes over Time (6)
Amount of Changes over Time (3)
Click-Through Rate Data (2)
Document Association to Search Terms (1)
Queries that Remain the Same but have New Meanings over Time
Staleness of Documents (3)
Link Behavior (4)
Freshness of Links (4)
Anchor Text Changes over Time (1)
Content Changes in a Document compared to Linking Anchor Text
Freshness of Anchor Text (2)
Traffic Characteristics of Site/Page (2)
User Behavior (2)
Domain Related Information (3)
Prior Rankings Data (4)
User Maintained Data (3)
Growth Profiles of Anchor Text (1)
Linkage of Independent Peers (1)
Document Topics (1)
Identifying Relevant Documents (1)
Plurality of History Data (1)
History Component (1)
Ranking of Linked Documents (10)
III. Documentation on Description Elements
Document Inception Date
Domain Related Information
User Maintained/Generated Data
Unique Words, Bigrams, Phrases in Anchor Text
Linkage of Independent Peers
IV. List of Additional Coverage & Resources
Overview of the 5 Most Critical Concepts from this Paper
These 5 concepts are what I believe to be the most ground-breaking and important for search
engine optimization professionals to understand in order to best conduct their work.
1. Google's Concept of "Document Inception"
The date of "document inception", which can refer to either a website as a whole or a single page
is used in many different areas by Google. This data can come from the registration info, the date
Google first found a link to the site/page or the site/page itself. Google will be using this data to
rank documents and establish credibility and relevance.
2. How Changing Content can Affect Rankings
Changing content over time has a huge impact in Google's measures according to this patent.
They use changes to determine "freshness" or "staleness" of websites and pages and how that
data impacts the value of the links on the page as well its rankings. They'll also measure large,
"real", content changes vs. superfluous changes and rank based on that data.
Google also says that for some types of queries, particular results are more valuable - stale
results may be desirable for information that doesn't need updating, fresh content is good for
results that require it, seasonal results may pop up or down in the rankings based on the time of
3. Spam Detection & Punishment
Google is employing many new systems of spam detection and prevention according to the
patent. These include:
Watching for sites that rise in the rankings too quickly
Watching for registration information, IP addresses, name servers, hosts, etc that are on
their "bad list"
Growth of off-topic links
Speed of link gain
Percentage of similar anchor text
Topic/Subject shifts or additions
4. What Google is Attempting to Measure
Google wants to measure or is attempting to actively measure each of the following:
o Registration date
o Length of renewal (10 years, 5 years, 1 year, etc)
o Addresses and Names of admin & technical contacts
o DNS Records
o Address of Name Servers
o Hosting Location & Company
o Stability of this data
Information on User Behavior Online
o CTR (Click-Through Rate) of individual results in the SERPs
o Length of time spent on a given site/page
Data contained on your computer
o Favorites/Bookmarks List
o Cache & Temp Files
o Frequency of visits to particular sites/pages (history)
5. The Impact of this Patent
I believe that this patent will help to verify most of the theories surrounding Google's rankings.
There has been speculation over the past 18-24 months on nearly every subject covered in this
patent at the major SEO forums, but this will serve as verification.
Although it is long, I urge every SEO/Webmaster to read this page completely. I have attempted
to make the information legible and readable, and only pulled out parts that are important to the
active practice of SEO (which was almost 2/3 of the document, surprisingly). If you have any
questions or corrections on this summary, please send me an email.
Analysis & Interpretation of the 63 Patent Components
1. Documents may be scored in Google's rankings based on "one or more types of history data".
2. The "inception date" read - registration date - may be considered as a scoring factor (I assume
that older will be considered better, but this is not spelled out).
3. Google may determine how old each of the pages on a given website is and then determine the
average age of pages on the website as a whole. The difference between a specific page's age and
the average age of all documents on the site will be used in the ranking score.
4. The score for a website may include the amount of time since "document inception" - i.e. how
old the website is.
5. One methodology of discovering site age might include when Google first "discovered" - read
spiders the site, when Google first finds a link to the site, and when the site contains a
"predetermined number of pages". I interpret this to mean that Google has some kind of
threshold for site size (number of pages) that when reached, triggers a scoring effect (probably
Frequency of Document Changes over Time
6. Google's scoring will (according to the patent) be based on "determining a frequency at which
the content changes over time".
7. The "frequency at which the content changes" will be determined by the average time between
changes, the number of changes over a particular time period, and the rate of change of one time
period vs. the rate of change for another time period. So, if you are updating your website every
day, then switch to updating once a week, your scoring in the historical measurements at Google
8. Scoring will also include how much of the site has changed over a given time period (new
pages, changes, etc.).
9. The scoring based on changes (described in #8) will be determined by the number of new
pages within a time period, the ratio of new pages vs. old pages and the total "percentage of the
content of the document that has changed during a timed period."
10. The scoring of changes (from #8) will be based on the "perceived importance of the portions"
that have been changed. The score will also take into account the changes as compared to the
weighting(s) of each of the different pages of the site - i.e. if important pages change, it will have
a different impact than if unimportant pages changed. My guess is that importance is mostly
determined by links (both internal and external) that point to a given page. So if your contact
page changes, it's not a big deal, but if your home page changes, that's a bigger deal.
11. The scoring for a "plurality of documents" - many pages in a given website - includes
determining the last date of change for each page, determining the average date of change, and
scoring the documents based on, "at least in part", the difference between a specific page's
change and the average document's change. So, if one page had new information added, it would
be scored differently than the other pages, while if all the pages changed together (maybe a new
date, or new link or copyright in the footer, etc.), they would all be equal (since their date of
change compared to the average is the same).
Amount of Changes over Time
12. Google's score may also include a measure of the amount of content which changes over time
on the given website.
13. The "amount of content changes" from #11 will be determined by the ratio of new pages vs.
the total number of pages on the site, and the percentage of content change over a given time
14. The "changes over a given time" from #12 will be scored based on "weighting different
portions of the content differently based on a perceived importance" - once again, I read this as
internal and external links to a page - the more links, the more "perceived importance".
Click-Through Rate Data
15. The "history data" from #1 could include information on "how often the document is selected
when the document is included in a set of search results". This is literally tracking clickthroughs
and rewarding those sites with higher CTR - just like AdWords does. Google will be scoring
based on the "extent to which the document is selected over time... when included in a set of
search results". We always assumed this to be true, but this is the first hard evidence I've seen
directly from the horse's mouth.
16. Google may assign a "higher score" when the document is selected more often. No-brainer.
Document Association to Search Terms
17. Google might be scoring based on "determining whether a document (that has been showing
up in the search results) is associated with the search terms".
Queries that Remain the Same but have New Meanings over Time
18. Google (according to the patent) calculates whether the "information relating to queries"
remains the same or changes and scores documents based on this. For example, prior to
September 11, the phrase 9-11 would not be related with terrorism, afterwards, it would be.
Google will score documents based on the changes in the results for a given query to keep up
with the times.
Staleness of Documents
19. The "staleness of documents" might be calculated as part of Google's scoring.
20. Google may also determine whether "stale documents" are preferable for certain types of
queries (those that don't change over time, or for which a specific, single answer is what's
21. The "favorability" of stale documents may be determined by how often they are clicked on in
the search results (over other documents). I relate this to a Wikipedia article on the nature of
volcanoes - it doesn't need too much updating and will be a good relevant source for a long time
for the query - "nature of volcanoes".
22. History data scores might also consider the "behavior of links over time".
23. The appearance and disappearance of links figure into the scoring for link behavior (from
24. The appearance/disappearance of links are dated by Google and used in the scoring.
25. The link appearances/disappearances are monitored and Google measures "how many links...
appear or disappear during a time period, and whether there is a trend" toward more links or
fewer links. The temporal (time-based) nature of groups of links will be scored by Google.
Freshness of Links
26. Google may use the "freshness of links" and assign weights to links based on freshness.
27. The "freshness" of a link (from #26) is calculated by the date of appearance of that link, the
date of any changes in the link or anchor text, the date which the page and site that the link is
from appeared and the date of the links to that linking page. So, if you have a new blog entry that
points to a new site, the freshness will be super-fresh, since the page is new, the link to the page
is new, the blog page that links to it is new, and the link to your blog entry on your own site is
new (that's a lot of new, hence it's super-fresh).
28. The weight of a link also takes into account how "trusted" the site is, how authoritative the
page with the link on it is and how "fresh" the page & site containing the link are.
29. The scoring also takes into account the "age distribution associated with the links based on
the ages of the links". Google will take into account the age of the links to your page, and the
time periods over which you got the links, i.e. lots of new links, a wide distribution over time,
most links from a long time ago, etc.
Anchor Text Changes over Time
30. Google may also calculate changes in anchor text over time and use this data to score. My
guess is that anchor text doesn't change very often, but they're certainly free to measure it.
Content Changes in a Document compared to Linking Anchor Text
31. Google might also measure if the content of a document changes, but the anchor text remains
the same, or vice-versa. They're trying to protect against the anchor text "bait and switch" that
makes a document look relevant to the anchor text, then replaces it with something else.
Freshness of Anchor Text
32. Freshness of anchor text can be considered.
33. Freshness of anchor text is calculated by "date of appearance", "date of change", and the
dates of change and appearance of the page the link is on.
Traffic Characteristics of Site/Page
34. Traffic characteristics associated with a page/site may be taken into account in scoring.
35. The traffic pattern will have associated analysis that might feed into Google's score. So
Google must be measuring traffic to a site/page and determining if, over time, it increases,
decreases, etc. - they're seeking trends on which to base scoring.
36. User behavior regarding a particular page/site may figure into the scoring.
37. Google says that user behavior (from #36) is basically just the percentage of the time users
click on a site/page when it is listed in the search results pages, along with the amount of time
that users spend "accessing the document". I guess we all need to keep up the amount of time
people spend on our sites.
Domain Related Information
38. The scoring might also include the sites associated with a given site and the "domain-related"
information. This is defined in greater detail below.
39. Associated sites (from #38) are measured in terms of "legitimacy", which I interpret to mean
non-spam, different owner, etc. Google says, specifically "scoring the document based... on
whether the domain associated with the document is legitimate."
40. The "expiration date of the domain", the "domain name server record" and the "name server
associated with the domain" are all parts of how Google will establish the legitimacy of an
Prior Rankings Data
41. History data scores could also take into account "information relating to a prior ranking".
This means Google will be storing information about previous rankings for a site and using them
to base scores on.
42. Google may also calculate where in the previous rankings the site was and how it moved
around as pieces to figure into the scoring data.
43. In reference to #41, Google is using seasonal, "burstiness" and changes in scores over time as
metrics to calculate the prior rankings scoring. So if a site is particularly relevant for "gifts for
girlfriend" around Valentine's Day, but not as much for the same query at Christmas, Google will
record this information and rank accordingly.
44. Google could also, with regard to #41, record "spikes in the rank" of site/pages in the search
User Maintained Data
45. "User maintained data" may also be recorded and monitored for the rankings scores.
46. "User maintained data" includes; favorites lists, bookmarks, temp files and cache files of
monitored users. I'm not sure how they could obtain this data without installing "Google
Spyware" - perhaps in the form of desktop search or the Google toolbar.
47. Monitoring the rate at which a site/page "is added to or removed from user generated data"
may be used in the scoring.
Growth Profiles of Anchor Text
48. Scores might include "growth profiles of anchor text" - Google could monitor the use of
anchor text in large groups and where/when they point to different sites & pages.
Linkage of Independent Peers
49. Information "relating to linkage of independent peers" might be added to scoring by
"determining the growth in a number of independent peers that include the document". Google
will basically be monitoring sites that are not in your subject category and how they link to you
(I assumed they meant non-related subject peers, but they actually mean off-topic sites; see -
Linkage of Independent Peers, below).
50. "Document topics" may be included in the scoring, this includes using "topic extraction". I
assume this is determined by Google's text mining and analysis of the actual words on the page.
Identifying Relevant Documents
51. Relevance of documents to a given search query may be part of the scoring system. This is
just Google's way of saying that documents about "pink dogs" will be part of those analyzed by
the ranking algorithm when a user queries "pink dogs".
Plurality of History Data
52. Google might also use "means for obtaining a plurality of types of history data associated
with the document" to score sites/pages. This just means that they will use a methodology that
groups all of the bits of historical information into the rankings together to determine scoring.
53. "History data" can be measured by Google and used in the rankings. I'm not sure to what
they're referring here - the entire quote is; "A system for scoring a document, comprising: a
history component configured to obtain one or more types of history data associated with a
document; and a ranking component configured to: generate a score for the document based, at
least in part, on the one or more types of history data."
Ranking of Linked Documents
54. Google may be measuring the documents you link to and scoring based "on a decaying
function of the age of the linkage data". So, fresher links vs. stale links will be taken into account
(although whether there is a positive or negative effect associated with this is unknown).
55. For #54, Google says the "linkage data includes at least one link." So, they won't be
measuring linkage data for pages with no links.
56. For #54, Google may include the anchor text in the linkage data.
57. For #54, Google says the "linkage data includes a rank based... on links and anchor text
provided by one more linking documents." Google is simply saying that linkage data includes the
anchor text and other info about the links coming to a page.
58. Google can use the "longevity of the linkage data" and determine from that an adjustment of
the rankings based on the changes, stability & age of the linkage data. They explain below how
they score this.
59. Google will be "penalizing the ranking if the longevity indicates a short life for the linkage
data and boosting the ranking if the longevity indicates a long life for the linkage data." Google
is, in effect, explaining a little of what we call "sandboxing" - they're saying that the older a link
is, the more value it has, while new links have relatively lower value. This doesn't completely
explain the effect, as many sites rank well quickly, etc. - but, it is an explanation for the
60. Google can adjust scoring by penalizing for linking documents they consider "stale" over a
period of time and boost scoring if the content is frequently updated. So, it's better to be linked to
on a page that frequently updates its content.
61. "Link churn" may be measured (explained in #62) and scoring adjusted based on this.
62. "Link churn" is "computed as a function of an extent to which one or more links provided by
the document changes over time". Once again, Google is referring to the changes in where links
point, their anchor text, etc. on a given page. More changes means more "link churn".
63. "Link churn" might create a penalization if it is above a certain threshold. So, if your links
are changing all the time, the link will not be as valuable. This would shut down the methods
used by the popular "Traffic Power/1p" spam company.
Background of the Invention:
This is designed for IR (Information Retrieval) Systems and specifically to the methods used to
generate search results.
Description of Related Art:
This information is largely irrelevant, but one important quote is: "There are several factors that
may affect the quality of the results generated by a search engine. For example, some web site
producers use spamming techniques to artificially inflate their rank. Also, "stale" documents (i.e.,
those documents that have not been updated for a period of time and, thus, contain stale data)
may be ranked higher than "fresher" documents (i.e., those documents that have been more
recently updated and, thus, contain more recent data). In some particular contexts, the higher
ranking stale documents degrade the search results. Thus, there remains a need to improve the
quality of results generated by search engines."
Summary of the Invention:
Google says "history data associated with the documents" may be used to score them in the
search results. The invention provides a "method for scoring a document" and it "may include
determining the age of linkage data associated with a linked document and ranking the linked
document based on a decaying function of the age of the linkage data."
Brief Description of the Drawings:
The drawings are all exceptional simple charts showing the process for examination. A PDF with
the charts at the bottom is available at http://files.bighosting.net/tr19070.pdf
Exemplary History Data:
This is the canonical and expository section of the patent description. It contains examples and
explanations of many of the most important parts of this study, including detailed descriptions
for many of the 63 components.
Document Inception Date
Google notes that the "date" label is used broadly and may include many time & date
measurements. Google describes several of the techniques used to obtain an "inception date" and
mentions that some techniques are "biased" because they can be influenced by a 3rd party.
The first technique used is when Google learns of or indexes the document - either by finding a
link to the site/page, or following it. A second technique uses the registration date of the URL or
the first time it was referenced in a "news article, newsgroup, mailing list" or combination of
these types of documents.
The patent mentions that Google assumes that a "fairly recent inception date will not have a
significant number of links from other documents." However, they say that the document's
rankings can be adjusted accordingly based on how well it is doing in terms of links with
consideration for its age.
Google is also wary of spam, they use the following example (which is already being quoted
around the web):
"Consider the example of a document with an inception date of yesterday that is referenced by
10 back links. This document may be scored higher by (Google) than a document with an
inception date of 10 years ago that is referenced by 100 back links because the rate of link
growth for the former is relatively higher than the latter. While a spiky rate of growth in the
number of back links may be a factor used by (Google) to score documents, it may also signal an
attempt to spam search engine 125. Accordingly, in this situation, (Google) may actually lower
the score of a document(s) to reduce the effect of spamming."
Google might also use the date of inception as a method for measuring the "rate at which links to
the document are created". They say that "this rate can then be used to score the document, for
example, giving more weight to documents to which links are generated more often."
The patent goes so far as to provide a formula for link-based score modification:
H = history-adjusted link score
L = link score given to the document, which can be derived using any known link scoring
technique that assigns a score to a document based on links to/from the document
F = elapsed time measured from the inception date associated with the document (or a window
within this period).
The result of this formula would be that on the day of inception, L will be divided by 0.301 - the
equivalent of multiplying L by 33.2. After 10 days (or any other unit of time), the formula will
divide L by 1.079, making H smaller and smaller as time goes on.
The patent then suggests that "for some queries, older documents may be more favorable than
newer ones" and that, as a result, Google may "adjust the score of a document based on the
difference (in age) from the average age of the result set". This would push certain pages up or
down in the rankings depending on their age and the age of their competition.
Google says that a "document's content changes over time may be used to generate/alter a score
associated with that document." They again offer a formula for calculating this:
f = a function, such as a sum or weighted sum
UF = update frequency score that represents how often a document (or page) is updated
UA = update amount score that represents how much the document (or page) has changed over
Google notes that UA can also be determined as:
The number of "new" or unique pages associated with a document over a period of time
The ratio of the number of new or unique pages associated with a document over a period
of time versus the total number of pages associated with that document
The amount that the document is updated over one or more periods of time (e.g., n % of a
document's visible content may change over a period t (e.g., last m months)), which
might be an average value
The amount that the document (or page) has changed in one or more periods of time (e.g.,
within the last x days)
UA could also different pieces of the content weighted differently, helping to eliminate changes
that are cosmetic or insubstantial. Google mentions:
They also identify some important areas where content changes might necessitate greater weight:
Anchor text of forward links
Google also mentions the use of trend analysis in the changes of a site/page by comparing an
acceleration or deceleration of the rate of change (amount of new content, etc.). Google notes
that maintaining all of this information may be too intensive for practical data storage and
proposes measuring only large changes and storing "term vectors" only or "a small portion" of a
page "determined to be important".
The patent notes that Google may, on occasion prefer stale documents for certain types of
queries. They may also cerate an average age of change and adjust the scoring for documents
based on their relations to the average (if more stale or more fresh content is desired).
This technique describes several phenomenon that can influence rankings:
Clicks on a site/page in the SERPs can be used to rank it higher or lower - those clicked
more often, move higher in the rankings (so make sure your title & description are good)
If a particular search term is increasingly associated with particular subjects, the pages on
those subjects would rank higher for that query. For example, the meaning of the word
"soap" was increasingly associated with Simple Object Access Protocol, rather than a
cleansing agent, so pages on those subjects rose in the results.
The number of search results for a particular term is measured to check for "hot topics" or
"breaking news" to help Google follow or become aware of trends. An example might be
the recent Tsunami in East Asia, where thousands of pages popped up overnight on the
Google also measures search queries whose answers or relevance changes over time.
They use the example of "World Series Champion" which would be different after each
"Staleness" can be a deciding factor in the rankings. Google will use user clicks and
traffic to decide if "stale" results are relevant for a particular query or not and rank
accordingly. Google says it measures "staleness" by:
o Creation Date
o Anchor Growth
o Content Changes
o Forward/Back link growth
Link Based Criteria
Google can measure various linking based factors including:
The dates new links appear to a site/page
Dates that link or pages linking to a site/page disappeared
The time-varying behavior of links to a page and any possible "trends" that are indicated
by this, i.e. is the site gaining links overall or losing them? A downward trend might
indicate "staleness", while an upward trend would indicate "freshness".
Google may check the number of new links to a document over a given time period
compared to the new links the document has received since it was first found. They'll also
use the "oldest age of the most recent y% of links compared to the age of the first link
Google gives an example in the patent of two websites that were both found 100 days
o Site #1 - 10% of the links were found less than 10 days ago
o Site #2 - 0% of the links were found less than 10 days ago
o This data might be used to " predict if a particular distribution signifies a
particular type of site (e.g., a site that is no longer updated, increasing or
decreasing in popularity, superceded, etc.)"
Freshness weights assigned to a link can also be used to rank sites/pages. Several factors
can influence link freshness:
o Date of appearance
o Date of change of anchor text
o Date of change of the page the link is on
o Date of appearance of page the link is on
o Google says they theorize that a page that is updated (significantly) while the link
remains the same is a good indicator of a "relevant and good" link.
Other weights for links include:
o How trusted the links are (they specifically mention government documents as
being assigned higher trust)
o How authoritative the websites and pages linking to the page are
o Freshness of the page/site - they mention the Yahoo! homepage as one where
links frequently appear and disappear.
o The "sum of the weight of the links" pointing to a page/site may be used to raise
or lower the scoring in the rankings. Google will measure the freshness of the
page based on the freshness of the links to it and the freshness of the pages which
the links are on.
o Age distribution over time will also be measured, i.e. a site/page will be compared
against all of its links over time and when it received them.
Google may use link date appearance to "detect spam", "where owners of documents or
their colleagues create links to their own document for the purpose of boosting the score
assigned by a search engine". Google says that legitimate sites/pages "attract back links
slowly" and that a "large spike in the quantity of back links" may signal either a "topical
phenomenon" or "attempts to spam a search engine."
o Google gives the example of the CDC website after the outbreak of SARS as an
example of a "topical phenomenon".
o Google gives 3 examples of link spam techniques - "exchanging links",
"purchasing links" or "gaining links from documents without editorial discretion
on making links".
o Google also gives examples of "documents that give links without editorial
discretion" - including guest books, referrer logs and "free for all pages that let
anyone add a link to a document."
A decrease over time in the number of links a document has can be used to indicate
irrelevance, and Google notes that it will discount the links from these "stale" documents.
The "dynamic-ness" of links will also be measured and scored, based on how consistently
links are given to a particular page. They use the example of "featured link" of the day
and note that they'll use a page score based on the pages that link to the page, "for all
versions of the documents within a window of time."
Google can use anchor text measurements to determine ranking scores:
Anchor text changes over time might be used to indicate "an update or change of focus"
on a site/page.
Anchor text that is no longer relevant or on-topic with the site/page it links to may be
tracked and discounted if necessary. Large document changes will result in Google
checking the anchor text to see if the subject matter is still the same as the anchor text.
Freshness of anchor text can be calculated. It can be determined by:
o Date of appearance/change of the anchor text
o Date of appearance/change of the linked to page
o Date of appearance/change of the page with the link on it
o Google notes that the date of appearance/change of the page with the link on it
makes the link and anchor text more "relevant and good"
Google can measure traffic levels to a page/site as part of their ranking scores.
A "large reduction in traffic may indicate that a document may be stale"
Google may compare the average traffic for a page/site over the past "j days" (as an
example j=30) to the average traffic over the last year to see if the page/site is still as
relevant for the query.
Google might also use seasonality to help determine if a particular site is more/less
relevant for a query during specific times of the year.
Google is going to measure "advertising traffic" for websites:
o "The extent to and rate at which advertisements are presented or updated by a
given document over time"
o The "quality of the advertisers". They note that referrers like Amazon.com will be
given more trust and weight than a "pornographic site's" advertisements.
o The "click-through rate" of the traffic referrals from the pages the ads are on.
Google may be measuring "aggregate user behavior". This can include:
The "number of times that a document is selected from a set of search results"
The "amount of time one or more users spend accessing the document"
The relative "amount of time" compared to an average that users spend on a particular
o Google uses an example of a swimming schedule page that users typically spent
30 seconds accessing, but have recently spent "a few seconds" accessing.
o Google says this can be an indication for them that the page "contains an outdated
swimming schedule" and they will push down its rank.
Information associated with a domain can be used by Google to score sites in the rankings. They
mention specific types of " information relating to how a document is hosted within a computer
network (e.g., the Internet, an intranet, etc.)" including:
Doorway and "throwaway" domains - Google says they will use "information regarding
the legitimacy of the domains"
Valuable domains, according to Google, "are often paid for several years in advance",
while the throwaway domains "rarely are used for more than a year."
The DNS records will also be checked to determine legitimacy:
o Who registered the domain
o Admin & technical addresses and contacts
o Address of name servers
o Stability of data (and host company) vs. high number of changes
Google claims they will use "a list of known-bad contact information, name servers,
and/or IP addresses" to predict whether a spammer is running the domain.
Google will also use information regarding a specific name server in similar ways -
o "A "good" name server may have a mix of different domains from different
registrars and have a history of hosting those domains, while a "bad" name server
might host mainly pornography or doorway domains, domains with commercial
words (a common indicator of spam), or primarily bulk domains from a single
registrar, or might be brand new"
Google can measure the history of where a site ranked over time and data associated with this.
Some specifics include:
A site that "jumps in rankings across many queries might be a topical document or it
could signal an attempt to spam search engine"
The "quantity or rate that a document moves in rankings over a period of time might be
used to influence future scores"
Sites can be weighted according to their position in the results, where the top result
receives a higher score and the lower sites receive progressively lower scores. Google
uses the equation:
o Where N=the number of search results measured and SLOT equals the ranking
position of the measured site
o In this equation, the 1st result receives a score of 1.0 and the last result receives a
score close to 0.
Google could check "commercial queries" specifically and documents that gained X% in
the rankings " may be flagged or the percentage growth in ranking may be used" to
determine if the "likelihood of spam is higher".
Google may also monitor:
o "The rate at which (a site/page) is selected as a search result over time"
o Seasonality - fluctuations based on the time of month or year
o Burstiness - Sudden gains or losses in clicks
o Other patterns in CTR
The rate of change in scores can be measured over time to see if a search term is getting
more/less competitive and additional attention is needed.
Google "may monitor the ranks of documents over time to detect sudden spikes in the
ranks". This could indicate, according to the patent, "either a topical phenomenon (e.g., a
hot topic) or an attempt to spam search engine"
Google may use preventative measures against spam by:
o "Employing hysteresis to allow a rank to grow at a certain rate" - hysteresis in this
instance probably means a pull that results in the growth rate falling. The terms
has dozens of unique definitions.
o Limiting the "maximum threshold of growth over a predefined window of time"
for a given site/page.
o Google will also "consider mentions of the document in news articles, discussion
groups, etc. on the theory that spam documents will not be mentioned"
Certain types of sites/pages (Google specifically mentions "government documents, web
directories (e.g., Yahoo), and documents that have shown a relatively steady and high
rank over time") may be immune to the "spike" tracking and penalization
Google may also "consider significant drops in ranks of documents as an indication that
these documents are "out of favor" or outdated"
User Maintained/Generated Data
Google wants to measure many different types of aggregate data that user keep on their
computers about their web visits and experiences, including:
Bookmarks & Favorites lists in the browser
o They want to obtain this data either via a "browse assistant" - like the toolbar or
desktop search, or.
o Directly via the browser itself - I predict they are developing their own Google
o Google will use this data over time to predict how valuable a particular site or
Google also wants to document additions and removals from favorites & bookmarks over
time to help predict the value of a site/page
Google will also measure how often users access the site/page from their browser to see
if it is still relevant, or just a leftover ("outdated" or "unpopular")
The "temp or cache files associated with users could be monitored" by Google to identify
their visiting patterns on the web and determine whether there is "an upward or
downward trend in interest" in a given site/page.
Unique Word, Bigrams, Phrases in Anchor Text
Google intends to measure the profile of how anchor text appears over time to a particular
site/page to watch for spam. They note that "naturally developed web graphs typically involve
independent decisions. Synthetically generated web graphs, which are usually indicative of an
intent to spam, are based on coordinated decisions". The difference in patterns can be measured
and put to use to block spam.
Google notes that the "spikiness" of "anchor words/bigrams/phrases" is a prime measurement.
They note that spam typical shows "the addition of a large number of identical anchors from
Linkage of Independent Peers
Google can also use link data from "independent peers (e.g., unrelated documents)" to check for
spam. They say that a " sudden growth in the number of independent peers... with a large number
of links... may indicate a potentially synthetic web graph, which is an indicator of an attempt to
spam." Google notes that this "indication may be strengthened if the growth corresponds to
anchor text that is unusually coherent or discordant" and that they can discount the value of these
links either by a "fixed amount" or a "multiplicative factor" - this would give an additional
penalty just for having these links.
Topic extraction can be performed by Google through the following methods:
A set of unique low frequency words
The goal is to "monitor the topic(s) of a document over time and use this information for scoring
Google notes that "a spike in the number of topics could indicate spam" or that significant
document topic changes may indicate that the website "has changed owners and previous
document indicators, such as score, anchor text, etc., are no longer reliable." Google says that "if
one or more of these situations are detected, (they) may reduce the relative score of such
documents and/or the links, anchor text, or other data" from the website.
List of Additional Coverage & Resources
1. The patent from US Patent and Trademark Office - US Patent #20050071741 -
Information retrieval based on historical data
2. From SEOChat Forums - Information Retrieval Based on Historical Data - Sandbox
Explanation, Aging Delay?
3. From Threadwatch - Google's War on SEO - Documented
4. From SearchEngineWatch Forums - Does New Google Patent Validate Sandbox Theory?
5. From HighRankings Forum - New Google Patent, Must Read
6. From SERoundtable - Sandbox Explained by Google? "Information retrieval based on
7. From Search Science (Xan Porter) - New Google patent proves "sandbox" exists
More Articles from SEOmoz
Search Marketing Issues
Search Spam & Black Hat
Paid Search Marketing
Social Media Marketing
Events & Conferencing
Blogging & the Blogosphere
General Marketing Topics
Web Design & Development
Web 2.0 Platforms & Tech