Embed
Email

Web Design

Document Sample

Shared by: yaoyufang
Categories
Tags
Stats
views:
0
posted:
12/16/2011
language:
pages:
3
CIS150 – Web Design



Research Paper 1 – Web Search Engines



 How do search engines work

 Your favorite search engine

 Fine-tuning your search



Do not change the format of this document!

Enter your name and start typing under the horizontal line.

2 pages minimum! You may incorporate small images in your text.



Student Name: Alexander Novikov

The Internet contains hundreds of millions of web pages. And while this is what makes it so great, it also creates

a problem of finding the page with desired information. Web search engines were created to solve that problem.



A search engine operates in the following order:

1. Web crawling.

2. Indexing.

3. Searching.



Search engines work by storing some information about many web pages. To find that information engines use

automated software robots called spiders. In the process known as Web Crawling the spiders build a list of words

contained on a page. Usually spider starts on some popular website, indexing words contained on its pages and then

follows all links found within a site. By doing that spiders quickly begin to travel the pages and spread among wider areas

of the Web. After that contents of each page are analyzed to determine how it should be indexed. Words occurring in

titles, headings and other positions of relative importance are noted for special consideration.



One of the most important of these positions is meta tags.



The tag provides metadata about the HTML document. Metadata is information about data. Metadata will not be

displayed on the page, but will be machine parsable. Meta elements are typically used to specify page description,

keywords, author of the document, last modified, and other metadata. The tag always goes inside the head

element.

Meta tags allow owner of a page to specify key words by which page would be indexed. It is very helpful in

cases when words on a page have a double meaning. In this case meta tags can guide search engine to the answer

which of the meanings is correct. But if search engine relies too much on a meta tags it creates dangerous situation

called keyword stuffing. If owner of a page is dishonest he can try to put in meta tags some popular words that have

nothing to do with actual contents of a page to get attention to his website. To avoid this problem spider must correlate

information on a page with meta tags and ignore meta tags that don’t match the words on a page. As the technologies to

avoid keyword stuffing gotten more and more intellectual, dishonest web masters found another way to get artificial

attention to their pages by creating artificial links on a page. Many search engines that use link analyze as a criteria of

ranking page.



The process of Web crawling never ends, because the Web is constantly changing. So the spider will

periodically come back to the sites to check for any information that has changed. Once the spider found information on a

web page, search engine must find a way to store this information in a way that makes it useful. The data about web

pages is stored in an index database. But what exactly should be stored? Most search engines store more than just word

and URL from where this word was found. Because for creating a ranked listing to present more useful information on top

of the list it needs more criteria than that. Millions of pages could contain the same word or phrase. But some pages may

be more relevant, popular or authoritative. How a search engine decides which pages are the best matches, and what

order the results should be shown in, varies widely from one engine to another. An engine might store number of times

the word is found on a page. It might assign a weight to each entry, value of which will increase if word was found in title,

heading or meta tag. Almost every search engine has its own formula for calculation of a weight. That’s why the same

search query in different engines might give very different results.



The data must be encoded to allow the big amount of information to be stored in compact form. After the data is

encoded it should be indexed. The main purpose of index is to make information accessible as quickly as possible. There

a few ways to index data but one of the most effective is to create hash table in which a formula applied to attach a

numerical value to each word. Hashing reduces average time needed to find an entry. The hash table contains hashed

number and pointer to an actual data. The combination of indexing with effective storage makes it possible to get results

quick even if search query is complicated.









Page 1

CIS150 – Web Design

When user entries a query into a search engine, the engine looks through its index and presents a listing of best-

matching web pages according to its criteria, usually in a form of a short summary containing title and sometimes part of

the text.



My favorite web search engine is Google search. It was created in 1997 by Larry Page and Sergey Brin as a

research in Stanford University. Now it’s the most popular web search engine in the world with a market share almost

83% of a web search market. And it had not got it for nothing. In my experience in 9 cases out of 10 Google search gave

me more precise and relevant search results than all other search engines that I tried. The algorithm called PageRank is

responsible for that precision.



PageRank is a patented Google algorithm that helps rank web pages that match given search query. Previous

keyword-based methods of ranking search results, used by many search engines that was once more popular than

Google, would rank the pages by how often the search terms occurred in a page, or by how strongly associated the

search terms were within each resulting page. The PageRank algorithm instead analyzes human-generated links

assuming that web pages linked from many important pages are themselves likely to be important. The algorithm

computes a recursive score for pages, based on the weighted sum of the PageRanks of the pages linking to them.

PageRank is thought to correlate well with human concepts of importance. In addition to PageRank, Google, over the

years, has added many other secret criteria for determining the ranking of pages on result lists, reported to be over 200

different indicators. The specifics of which are kept secret to keep spammers at bay and help Google maintain an edge

over its competitors globally.



Google’s spider robot called Googlebot gives the indexer the full text of the pages it finds. These pages are

stored in Google’s index database that contains billions of web pages. This index is sorted alphabetically by search term,

with each index entry storing a list of documents in which the term appears and the location within the text where it

occurs. This data structure allows rapid access to documents that contain user query terms.



To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the,

is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to

narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple

spaces, as well as converting all letters to lowercase, to improve Google’s performance. Google also applies machine-

learning techniques to improve its performance automatically by learning relationships and associations within the stored

data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google

closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to

outwit the latest devious techniques used by spammers.



Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives

more priority to pages that have search terms near each other and in the same order as the query. Google can also match

multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can

restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the

page, options offered by Google’s Advanced Search Form and Using Search Operators (Advanced Operators).



Besides main search engine feature of searching text Google has a lot of other features. One of the newest

additions is the possibility to search by images by dragging image into the search bar. Maps are available by typing

address in search bar. Weather can be checked by typing weather and the name of the city. Currency rates and calculator

are also available.



To fine tune your search you can use 1 of more than 15 special search options. Some of the Google’s operators

and query options:

1) OR – search for either one of the few values.

2) “-” – search while excluding a word.

3) “” – double quotation to search for the exact word or phrase.

4) "*" – Wildcard operator to match any words between other specific words.

5) Define: – to get dictionary definition of a word.

6) Site: – to restrict search to specific domain.

7) allintitle: – Only the page titles are searched(not the remaining text on each webpage).

8) intitle: – Prefix to search in a webpage title, such as "intitle:google search" will list pages with word "google" in title,

and word "search" anywhere (no space after "intitle:").

9) allinurl: – Only the page URL address lines are searched (not the text inside each webpage).

10) inurl: – Prefix for each word to be found in the URL; others words are matched anywhere.

11) cache: – Highlights the search-words within the cached document.



12) link: – The prefix "link:" will list webpages that have links to the specified webpage, such as "link:www.google.com"

lists webpages linking to the Google homepage.



13) related: – The prefix "related:" will list webpages that are "similar" to a specified web page.









Page 2

CIS150 – Web Design

14) info: – The prefix "info:" will display some background information about one specified webpage. Typically, the info

is the first text (160 bytes, about 23 words) contained in the page, displayed in the style of a results entry (for just

the 1 page as matching the search).



15) filetype: – results will only show files of the desired type.









Page 3



Related docs
Other docs by yaoyufang
Installing_NLB_For_A_Server_Switch
Views: 0  |  Downloads: 0
Graduate programmes _Task Force 2_
Views: 0  |  Downloads: 0
Moral Dilemmas
Views: 0  |  Downloads: 0
“Swiss Army PDA” Web site
Views: 0  |  Downloads: 0
I98v2p566
Views: 0  |  Downloads: 0
Euromat 2011 Symposium G11 Questionnaire
Views: 3  |  Downloads: 0
thesis proposal
Views: 0  |  Downloads: 0
re44gm
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!