CIS150 – Web Design
Research Paper 1 – Web Search Engines
How do search engines work
Your favorite search engine
Fine-tuning your search
Do not change the format of this document!
Enter your name and start typing under the horizontal line.
2 pages minimum! You may incorporate small images in your text.
Student Name: Alexander Novikov
The Internet contains hundreds of millions of web pages. And while this is what makes it so great, it also creates
a problem of finding the page with desired information. Web search engines were created to solve that problem.
A search engine operates in the following order:
1. Web crawling.
2. Indexing.
3. Searching.
Search engines work by storing some information about many web pages. To find that information engines use
automated software robots called spiders. In the process known as Web Crawling the spiders build a list of words
contained on a page. Usually spider starts on some popular website, indexing words contained on its pages and then
follows all links found within a site. By doing that spiders quickly begin to travel the pages and spread among wider areas
of the Web. After that contents of each page are analyzed to determine how it should be indexed. Words occurring in
titles, headings and other positions of relative importance are noted for special consideration.
One of the most important of these positions is meta tags.
The tag provides metadata about the HTML document. Metadata is information about data. Metadata will not be
displayed on the page, but will be machine parsable. Meta elements are typically used to specify page description,
keywords, author of the document, last modified, and other metadata. The tag always goes inside the head
element.
Meta tags allow owner of a page to specify key words by which page would be indexed. It is very helpful in
cases when words on a page have a double meaning. In this case meta tags can guide search engine to the answer
which of the meanings is correct. But if search engine relies too much on a meta tags it creates dangerous situation
called keyword stuffing. If owner of a page is dishonest he can try to put in meta tags some popular words that have
nothing to do with actual contents of a page to get attention to his website. To avoid this problem spider must correlate
information on a page with meta tags and ignore meta tags that don’t match the words on a page. As the technologies to
avoid keyword stuffing gotten more and more intellectual, dishonest web masters found another way to get artificial
attention to their pages by creating artificial links on a page. Many search engines that use link analyze as a criteria of
ranking page.
The process of Web crawling never ends, because the Web is constantly changing. So the spider will
periodically come back to the sites to check for any information that has changed. Once the spider found information on a
web page, search engine must find a way to store this information in a way that makes it useful. The data about web
pages is stored in an index database. But what exactly should be stored? Most search engines store more than just word
and URL from where this word was found. Because for creating a ranked listing to present more useful information on top
of the list it needs more criteria than that. Millions of pages could contain the same word or phrase. But some pages may
be more relevant, popular or authoritative. How a search engine decides which pages are the best matches, and what
order the results should be shown in, varies widely from one engine to another. An engine might store number of times
the word is found on a page. It might assign a weight to each entry, value of which will increase if word was found in title,
heading or meta tag. Almost every search engine has its own formula for calculation of a weight. That’s why the same
search query in different engines might give very different results.
The data must be encoded to allow the big amount of information to be stored in compact form. After the data is
encoded it should be indexed. The main purpose of index is to make information accessible as quickly as possible. There
a few ways to index data but one of the most effective is to create hash table in which a formula applied to attach a
numerical value to each word. Hashing reduces average time needed to find an entry. The hash table contains hashed
number and pointer to an actual data. The combination of indexing with effective storage makes it possible to get results
quick even if search query is complicated.
Page 1
CIS150 – Web Design
When user entries a query into a search engine, the engine looks through its index and presents a listing of best-
matching web pages according to its criteria, usually in a form of a short summary containing title and sometimes part of
the text.
My favorite web search engine is Google search. It was created in 1997 by Larry Page and Sergey Brin as a
research in Stanford University. Now it’s the most popular web search engine in the world with a market share almost
83% of a web search market. And it had not got it for nothing. In my experience in 9 cases out of 10 Google search gave
me more precise and relevant search results than all other search engines that I tried. The algorithm called PageRank is
responsible for that precision.
PageRank is a patented Google algorithm that helps rank web pages that match given search query. Previous
keyword-based methods of ranking search results, used by many search engines that was once more popular than
Google, would rank the pages by how often the search terms occurred in a page, or by how strongly associated the
search terms were within each resulting page. The PageRank algorithm instead analyzes human-generated links
assuming that web pages linked from many important pages are themselves likely to be important. The algorithm
computes a recursive score for pages, based on the weighted sum of the PageRanks of the pages linking to them.
PageRank is thought to correlate well with human concepts of importance. In addition to PageRank, Google, over the
years, has added many other secret criteria for determining the ranking of pages on result lists, reported to be over 200
different indicators. The specifics of which are kept secret to keep spammers at bay and help Google maintain an edge
over its competitors globally.
Google’s spider robot called Googlebot gives the indexer the full text of the pages it finds. These pages are
stored in Google’s index database that contains billions of web pages. This index is sorted alphabetically by search term,
with each index entry storing a list of documents in which the term appears and the location within the text where it
occurs. This data structure allows rapid access to documents that contain user query terms.
To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the,
is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to
narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple
spaces, as well as converting all letters to lowercase, to improve Google’s performance. Google also applies machine-
learning techniques to improve its performance automatically by learning relationships and associations within the stored
data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google
closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to
outwit the latest devious techniques used by spammers.
Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives
more priority to pages that have search terms near each other and in the same order as the query. Google can also match
multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can
restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the
page, options offered by Google’s Advanced Search Form and Using Search Operators (Advanced Operators).
Besides main search engine feature of searching text Google has a lot of other features. One of the newest
additions is the possibility to search by images by dragging image into the search bar. Maps are available by typing
address in search bar. Weather can be checked by typing weather and the name of the city. Currency rates and calculator
are also available.
To fine tune your search you can use 1 of more than 15 special search options. Some of the Google’s operators
and query options:
1) OR – search for either one of the few values.
2) “-” – search while excluding a word.
3) “” – double quotation to search for the exact word or phrase.
4) "*" – Wildcard operator to match any words between other specific words.
5) Define: – to get dictionary definition of a word.
6) Site: – to restrict search to specific domain.
7) allintitle: – Only the page titles are searched(not the remaining text on each webpage).
8) intitle: – Prefix to search in a webpage title, such as "intitle:google search" will list pages with word "google" in title,
and word "search" anywhere (no space after "intitle:").
9) allinurl: – Only the page URL address lines are searched (not the text inside each webpage).
10) inurl: – Prefix for each word to be found in the URL; others words are matched anywhere.
11) cache: – Highlights the search-words within the cached document.
12) link: – The prefix "link:" will list webpages that have links to the specified webpage, such as "link:www.google.com"
lists webpages linking to the Google homepage.
13) related: – The prefix "related:" will list webpages that are "similar" to a specified web page.
Page 2
CIS150 – Web Design
14) info: – The prefix "info:" will display some background information about one specified webpage. Typically, the info
is the first text (160 bytes, about 23 words) contained in the page, displayed in the style of a results entry (for just
the 1 page as matching the search).
15) filetype: – results will only show files of the desired type.
Page 3