Baidu_ Google search engine principles by fdjerue7eeu


									Baidu, Google search engine principles
Section search engine principles
1, the basic concept
   ?Chinese wiki explanation from Wikipedia: (network) search engine that
automatically collect information from the Internet, through a certain order after,
available to users to query the system.
   ?Wikipedia's explanation comes from the English wiki: Web search
engines provide an interface to search for information on the World Wide
Web.Information may consist of web pages, images and other types of files. (Web
search engine provides the interface for users to find on the Internet information
content, the information including web pages, images and other types of documents)
2, classification
   ?In accordance with the principle of the different, they could be divided into two
basic categories: full-text search engine (FullText Search Engine) and Category
   ?Classification is a way of manually collected data form a database of sites, such as
Yahoo and domestic Sohu, Sina, Netease categories. In addition, a number of
navigation on the web site can also be assigned to the original categories, such as
"at home" (
   ?Full-text search engine by way of automatic Web page hyperlink, to rely on
hyperlinks and HTML code analysis for web page content, according to the rules of
order had been designed for the formation of the index for user queries.
   ?Distinction can be summarized in one sentence: Category directory is manually
create a site index, full-text search is a way of automatically create web pages indexed.
(Some people often search engine and database search compared, in fact, wrong).
3, full-text search works
   ?Full-text search engine, general information gathering, indexing, searching
three-part, detailed by the search engine, parser, indexer, crawler and user interface
components such as 5
   ?(1) information collection (Web crawling): Information collected by the search
engine and analyzer together to complete the search engine use is called web crawler
(crawlers), spider (spider) or a robot called the network (robots) to automatically
search Robot program to check page hyperlinks.
   ?Further explain: "Robots" is in fact a number of Web-based
programs, by requesting the Web site HTML pages to the HTML pages on the
acquisition, which traverse the entire specified range of Web space, and continuously
from one page to another page, move from one site to another site, will be collected to
add to the web page database. "Robot" when confronted with a
new Web page, must search all the links within it, so in theory, if the
"robot" to establish an appropriate set of initial pages, starting
from the initial page set, traverse all the links, "robot" will be
able to capture the whole Web page space.
   ?After a lot of open source web crawler, you can find some open source
   ?Key point 1: Analysis of the core is html, so strict, structured, readable, error less
html code, more likely to be collected for analysis and collection of robots. For
example, a page there <body tag like or do not </ body>
</ html> this end, the website shows there is no problem, but is likely
to be refused collection included, for example similar to .. /. ./***. htm This hyperlink
may also cause spider does not recognize. This is also the need to promote one of the
reasons web standards, in accordance with the standard web pages produced by search
engines more easily retrieved and included.
   ?Key points 2: Search robots have a special search link library, the same hyperlink
in the search, it will automatically compare the content and size of the old and new
pages, if consistency is not collected. Therefore, some people worry that the modified
page can be included if it is redundant.
   ?(2) Index (Indexing): search engines organize information in a process called
"indexing." Search engine not only to gather up information to
save, but also will they be arranged in accordance with certain rules. Index can adopt
a common large databases, such as ORACLE, Sybase, etc., can also store their own
custom file formats. Index is the more complex part of the search, involving page
structure analysis, segmentation, sequencing techniques, a good index can greatly
improve the retrieval speed.
   ?Key point 1: Although the search engines to support incremental indexing, but the
index still need more time to create a search engine index will be updated regularly, so
even if the reptile came, that we can searched the page, there will be a certain time
   ?Key points 2: difference between good or bad search index is an important
   ?(3) Search (Searching): Users send queries to search engines, search engine
queries to the user back to the information received. Some systems return to the page
before the results were calculated and correlation assessment, and sort according to
relevance, the relevance big in the beginning and relevance on the small of the back;
also some systems before the user query has been calculated level of each web page
(Page Rank will be introduced after the text), return query results will be big on the
front page rank, page rank on the small of the back.
   ?Key point 1: Different search engines have different collations, so different search
engines search the same keywords, ranking is different.

Section Baidu search engine works
   ?I know Baidu search: Due to the relationship between the niche has been fortunate
to know all companies use Baidu's search engine (the sector has now been
laid off, mainly Baidu's strategy began to move closer to Google, search
engines are no longer sold separately, turning search services), according to
Baidu's sales force, said the search for the core and know all the same great
search only possible version of the slightly lower, so I have reason to believe that
similar search of work. Here are some brief and attention points:
   ?1, the update frequency on the site search
   ?Baidu search site updates can set the frequency and time, the general frequency of
updates for large sites quickly, and will set up a separate reptile specialist track, but
Baidu is relatively hard, medium and small sites will usually be updated daily.
Therefore, if you want to update your site more quickly, preferably in the large
category (eg yahoo sina Netease) in your links or its related sites in Baidu, there is a
hyperlink on your site, in or your website which in some large sites, such as large site
   ?2, the depth of collection
   ?Baidu search to define the depth of collection, that Baidu will not necessarily
retrieve the entire contents of your site, there may be only the index page of your site
content, particularly for small web sites.
   ?3, with respect to the collection site often unreasonable
   ?Baidu off for the site is dedicated to judge, and if once found a site barrier, in
particular, a number of small sites, Baidu's automatic stop to these sites to
send reptiles, so choose a good server, and keep the site 24-hour flow is very
important .
   ?4 IP sites on the replacement of
   ?Baidu search can be based on domain name or ip address, if domain name is
automatically resolved to the corresponding ip address, so there will be two issues,
first is if your site and others use the same IP address, if a person's home is
Baidu punished, your site will be implicated, and the second is that if you change the
ip address, and Baidu will find your domain name and does not correspond to the
previous ip address, also refused to send to your site reptiles. Therefore suggested, do
not arbitrarily change ip address, if possible, try to exclusive ip, stability is very
important to maintain the site.
   ?5, on the collection of static and dynamic websites
   ?Many people worry about is not similar to the asp? Id = the type of page is
difficult to collect, html pages so easy to collect, in fact did not think so bad, now the
search engine most of the support dynamic site acquisition and retrieval, including the
need to visit the site can be retrieved, so need not worry about your own dynamic site
search engine does not recognize, Baidu search for dynamic support can be
customized. However, if possible, or try to generate static pages. Meanwhile, for most
search engines still on the script Jump (JS), frame (frame),
Flash hyperlinks, dynamic page, the page contains illegal characters do nothing.
   ?6, on the disappearance of the index
   ?Earlier mentioned, search the index need to create a generally good search, the
index is a text file, rather than the database, so the index to delete a record, not an
easy thing. Such as Baidu, need to use specialized tools, manual deletion of a section
index record. According to staff, said Baidu, Baidu specifically a group of people
responsible for this incident - a complaint, delete records, manually. Of course be able
to directly remove a rule all the indexes, you can delete a site that is all under the
index. There is also a mechanism (not verified), is cheating for the expired pages and
pages (mainly the page title, keywords and content do not match), in the process of
rebuilding the index will be deleted.
   ?7, about to re-
   ?Baidu to search for the ideal weight as Google, mainly determine the title and
source addresses, as long as not the same, they will not automatically go to heavy, so
do not worry about collecting the contents of the similarities and quickly search for
punishment, is different from Google , with the same title was also included much.
   ?Add, do not want to be so intelligent search engine, basically in accordance with
certain rules and formulas, like search engines are not punished, you can avoid these

Technology Section Google search ranking
   ?The search for, Google is stronger than Baidu, the main reason is that Google is
more just, and Baidu has a lot of man-made factors (which also conforms to
China's national conditions), google the reason just, from his rank
technique Page Rank.
   ?Many people know the Page Rank, is the quality and grade of the site, the less said
the more outstanding website. In fact, Page Rank is a special formula to rely on out,
when we google search for keywords, page rank Page Rank will be little more toward
the front, this formula and no manual intervention, so fair.
   ?Page Rank of the original idea came from the management of paper files, we
know that all references at the end of each paper, if an article was cited several times
in different papers, you can think of this article is of excellent articles.
   ?Similarly, simply, PageRank to the importance of the web page to make an
objective evaluation. PageRank does not count the number of direct links, but
pointing to page B from page A links to explain the grounds on the Web page A B cast
a vote. This, PageRank page B based on the number of votes received to assess the
importance of the page. In addition, PageRank will assess the importance of each vote
page, as votes from some pages are considered high value, so that it can link to web
pages get higher value.
   ?Page Rank formula is omitted here, to talk about the main factors affect the Page
   ?1, point your Web site hyperlink number (your web site quoted by others), a larger
number, the more important that your website, popular to say, that is, whether the
links to other sites, or recommend links to your site;
   ?2, links the importance of your website, meaning that a good quality site has a
hyperlink to your web site, explain your site is very good.
   ?3, page-specific factors: including the content of the page, the title and URL, etc.,
that is, keywords and location page.

Section IV how to respond to the new site search
   ?The following is a summary of the above analysis:
1, Why do not your search engine to your site, there are the following possible (not
absolute, according to their situation is different)
   ?(1) there is no point to link the island page has not been included in the hyperlink
pointing to your site, search engines can not find you;
   ?(2) the nature of web pages and file types (such as flash, JS jump, some of the
dynamic web page, frame, etc.) Search engines can not recognize;
   ?(3) where the server your website search engine had been punished, but not the
same as your IP content;
   ?(4) The recent replacement server IP addresses, search engines take time to
   ?(5) server instability, frequent downtime, or can not stand the pressure of reptile
   ?(6) bad code page, the search can not correctly analyze the page, please at least
learn about the basic syntax of HTML is recommended to use XHTML;
   ?(7) Web site with robots (robots.txt) protocol to block search engines crawl the
   ?(8) cheat pages with keywords, website keywords and content of the serious
mismatch, or some keyword density too;
   ?(9) illegal content of web pages;
   ?(10) the same memory in a lot of the same Web site title page, or no real meaning
of the title page;
2, the new station how to do it right (for reference)
   ?(1) and excellent site to exchange links;
   ?(2) wide variety of great site's login directory listing;
   ?(3) the quality of a good forum to speak more, speak to have quality, best not to
return, statement, leave your website address;
   ?(4) for great Web site's blog (Sina, Netease, CSDN), and in a blog to
promote your Web site;
   ?(5) the use of good Jianzhan program, the best can generate static pages and
automatically generated key words;
   ?(6) attention to the title of each page, and <head> region, as far as
possible consistent with the key words on these easy to search index location,
attention to the beginning of the article, the beginning of the article as much as
possible using a similar function of the abstract (You can learn NetEase's
article style).
   ?Such as "based on open source jabber (XMPP) set up an internal instant
messaging service solution";
   ?Part Title: <title> based on open source jabber (XMPP) set up an
internal instant messaging service solutions - fertilizer Flanders (expendable) column
- CSDNBlog </ title>
   ?Keywords section: <meta name="keywords"
cCOLOR: #c00000"> installation, ">
   ?Article description: <meta name="description"
cCOLOR: #c00000"> the famous instant messaging service server,
which is a free open source software, allows users to frame their own instant
messaging server, Internet, applications can also be In the LAN application.
   ?XMPP (Extensible Messaging and Presence Protocol) is based on Extensible
Markup Language (XML) protocol, which used to instant messaging (IM) and online
site detection. It is in the promotion of the server of
Between the quasi-real-time operation. This agreement may eventually allow Internet
users to the Internet to send instant messages to any other person, even if the different
operating systems and browsers. XMPP technology from
In Jabber, it is in fact the core of the Agreement Jabber, so sometimes erroneously
referred to as Jabber XMPP protocol. Jabber is an IM application based on XMPP
protocol, in addition to outside Jabber, XMPP also supports many applications.

To top