Baidu, Google search engine principles ? Section search engine principles 1, the basic concept ?Chinese wiki explanation from Wikipedia: (network) search engine that automatically collect information from the Internet, through a certain order after, available to users to query the system. ?Wikipedia&#39;s explanation comes from the English wiki: Web search engines provide an interface to search for information on the World Wide Web.Information may consist of web pages, images and other types of files. (Web search engine provides the interface for users to find on the Internet information content, the information including web pages, images and other types of documents) 2, classification ?In accordance with the principle of the different, they could be divided into two basic categories: full-text search engine (FullText Search Engine) and Category Directory). ?Classification is a way of manually collected data form a database of sites, such as Yahoo and domestic Sohu, Sina, Netease categories. In addition, a number of navigation on the web site can also be assigned to the original categories, such as &quot;at home&quot; (http://www.hao123.com/). ?Full-text search engine by way of automatic Web page hyperlink, to rely on hyperlinks and HTML code analysis for web page content, according to the rules of order had been designed for the formation of the index for user queries. ?Distinction can be summarized in one sentence: Category directory is manually create a site index, full-text search is a way of automatically create web pages indexed. (Some people often search engine and database search compared, in fact, wrong). 3, full-text search works ?Full-text search engine, general information gathering, indexing, searching three-part, detailed by the search engine, parser, indexer, crawler and user interface components such as 5 ?(1) information collection (Web crawling): Information collected by the search engine and analyzer together to complete the search engine use is called web crawler (crawlers), spider (spider) or a robot called the network (robots) to automatically search Robot program to check page hyperlinks. ?Further explain: &quot;Robots&quot; is in fact a number of Web-based programs, by requesting the Web site HTML pages to the HTML pages on the acquisition, which traverse the entire specified range of Web space, and continuously from one page to another page, move from one site to another site, will be collected to add to the web page database. &quot;Robot&quot; when confronted with a new Web page, must search all the links within it, so in theory, if the &quot;robot&quot; to establish an appropriate set of initial pages, starting from the initial page set, traverse all the links, &quot;robot&quot; will be able to capture the whole Web page space. ?After a lot of open source web crawler, you can find some open source community. ?Key point 1: Analysis of the core is html, so strict, structured, readable, error less html code, more likely to be collected for analysis and collection of robots. For example, a page there &lt;body tag like or do not &lt;/ body&gt; &lt;/ html&gt; this end, the website shows there is no problem, but is likely to be refused collection included, for example similar to .. /. ./***. htm This hyperlink may also cause spider does not recognize. This is also the need to promote one of the reasons web standards, in accordance with the standard web pages produced by search engines more easily retrieved and included. ?Key points 2: Search robots have a special search link library, the same hyperlink in the search, it will automatically compare the content and size of the old and new pages, if consistency is not collected. Therefore, some people worry that the modified page can be included if it is redundant. ?(2) Index (Indexing): search engines organize information in a process called &quot;indexing.&quot; Search engine not only to gather up information to save, but also will they be arranged in accordance with certain rules. Index can adopt a common large databases, such as ORACLE, Sybase, etc., can also store their own custom file formats. Index is the more complex part of the search, involving page structure analysis, segmentation, sequencing techniques, a good index can greatly improve the retrieval speed. ?Key point 1: Although the search engines to support incremental indexing, but the index still need more time to create a search engine index will be updated regularly, so even if the reptile came, that we can searched the page, there will be a certain time interval. ?Key points 2: difference between good or bad search index is an important symbol. ?(3) Search (Searching): Users send queries to search engines, search engine queries to the user back to the information received. Some systems return to the page before the results were calculated and correlation assessment, and sort according to relevance, the relevance big in the beginning and relevance on the small of the back; also some systems before the user query has been calculated level of each web page (Page Rank will be introduced after the text), return query results will be big on the front page rank, page rank on the small of the back. ?Key point 1: Different search engines have different collations, so different search engines search the same keywords, ranking is different. Section Baidu search engine works ?I know Baidu search: Due to the relationship between the niche has been fortunate to know all companies use Baidu&#39;s search engine (the sector has now been laid off, mainly Baidu&#39;s strategy began to move closer to Google, search engines are no longer sold separately, turning search services), according to Baidu&#39;s sales force, said the search for the core and know all the same great search only possible version of the slightly lower, so I have reason to believe that similar search of work. Here are some brief and attention points: ?1, the update frequency on the site search ?Baidu search site updates can set the frequency and time, the general frequency of updates for large sites quickly, and will set up a separate reptile specialist track, but Baidu is relatively hard, medium and small sites will usually be updated daily. Therefore, if you want to update your site more quickly, preferably in the large category (eg yahoo sina Netease) in your links or its related sites in Baidu, there is a hyperlink on your site, in or your website which in some large sites, such as large site blog. ?2, the depth of collection ?Baidu search to define the depth of collection, that Baidu will not necessarily retrieve the entire contents of your site, there may be only the index page of your site content, particularly for small web sites. ?3, with respect to the collection site often unreasonable ?Baidu off for the site is dedicated to judge, and if once found a site barrier, in particular, a number of small sites, Baidu&#39;s automatic stop to these sites to send reptiles, so choose a good server, and keep the site 24-hour flow is very important . ?4 IP sites on the replacement of ?Baidu search can be based on domain name or ip address, if domain name is automatically resolved to the corresponding ip address, so there will be two issues, first is if your site and others use the same IP address, if a person&#39;s home is Baidu punished, your site will be implicated, and the second is that if you change the ip address, and Baidu will find your domain name and does not correspond to the previous ip address, also refused to send to your site reptiles. Therefore suggested, do not arbitrarily change ip address, if possible, try to exclusive ip, stability is very important to maintain the site. ?5, on the collection of static and dynamic websites ?Many people worry about is not similar to the asp? Id = the type of page is difficult to collect, html pages so easy to collect, in fact did not think so bad, now the search engine most of the support dynamic site acquisition and retrieval, including the need to visit the site can be retrieved, so need not worry about your own dynamic site search engine does not recognize, Baidu search for dynamic support can be customized. However, if possible, or try to generate static pages. Meanwhile, for most search engines still on the script Jump (JS), frame (frame), Flash hyperlinks, dynamic page, the page contains illegal characters do nothing. ?6, on the disappearance of the index ?Earlier mentioned, search the index need to create a generally good search, the index is a text file, rather than the database, so the index to delete a record, not an easy thing. Such as Baidu, need to use specialized tools, manual deletion of a section index record. According to staff, said Baidu, Baidu specifically a group of people responsible for this incident - a complaint, delete records, manually. Of course be able to directly remove a rule all the indexes, you can delete a site that is all under the index. There is also a mechanism (not verified), is cheating for the expired pages and pages (mainly the page title, keywords and content do not match), in the process of rebuilding the index will be deleted. ?7, about to re- ?Baidu to search for the ideal weight as Google, mainly determine the title and source addresses, as long as not the same, they will not automatically go to heavy, so do not worry about collecting the contents of the similarities and quickly search for punishment, is different from Google , with the same title was also included much. ?Add, do not want to be so intelligent search engine, basically in accordance with certain rules and formulas, like search engines are not punished, you can avoid these rules. Technology Section Google search ranking ?The search for, Google is stronger than Baidu, the main reason is that Google is more just, and Baidu has a lot of man-made factors (which also conforms to China&#39;s national conditions), google the reason just, from his rank technique Page Rank. ?Many people know the Page Rank, is the quality and grade of the site, the less said the more outstanding website. In fact, Page Rank is a special formula to rely on out, when we google search for keywords, page rank Page Rank will be little more toward the front, this formula and no manual intervention, so fair. ?Page Rank of the original idea came from the management of paper files, we know that all references at the end of each paper, if an article was cited several times in different papers, you can think of this article is of excellent articles. ?Similarly, simply, PageRank to the importance of the web page to make an objective evaluation. PageRank does not count the number of direct links, but pointing to page B from page A links to explain the grounds on the Web page A B cast a vote. This, PageRank page B based on the number of votes received to assess the importance of the page. In addition, PageRank will assess the importance of each vote page, as votes from some pages are considered high value, so that it can link to web pages get higher value. ?Page Rank formula is omitted here, to talk about the main factors affect the Page Rank ?1, point your Web site hyperlink number (your web site quoted by others), a larger number, the more important that your website, popular to say, that is, whether the links to other sites, or recommend links to your site; ?2, links the importance of your website, meaning that a good quality site has a hyperlink to your web site, explain your site is very good. ?3, page-specific factors: including the content of the page, the title and URL, etc., that is, keywords and location page. Section IV how to respond to the new site search ?The following is a summary of the above analysis: 1, Why do not your search engine to your site, there are the following possible (not absolute, according to their situation is different) ?(1) there is no point to link the island page has not been included in the hyperlink pointing to your site, search engines can not find you; ?(2) the nature of web pages and file types (such as flash, JS jump, some of the dynamic web page, frame, etc.) Search engines can not recognize; ?(3) where the server your website search engine had been punished, but not the same as your IP content; ?(4) The recent replacement server IP addresses, search engines take time to re-capture; ?(5) server instability, frequent downtime, or can not stand the pressure of reptile collection; ?(6) bad code page, the search can not correctly analyze the page, please at least learn about the basic syntax of HTML is recommended to use XHTML; ?(7) Web site with robots (robots.txt) protocol to block search engines crawl the Web; ?(8) cheat pages with keywords, website keywords and content of the serious mismatch, or some keyword density too; ?(9) illegal content of web pages; ?(10) the same memory in a lot of the same Web site title page, or no real meaning of the title page; 2, the new station how to do it right (for reference) ?(1) and excellent site to exchange links; ?(2) wide variety of great site&#39;s login directory listing; ?(3) the quality of a good forum to speak more, speak to have quality, best not to return, statement, leave your website address; ?(4) for great Web site&#39;s blog (Sina, Netease, CSDN), and in a blog to promote your Web site; ?(5) the use of good Jianzhan program, the best can generate static pages and automatically generated key words; ?(6) attention to the title of each page, and &lt;head&gt; region, as far as possible consistent with the key words on these easy to search index location, attention to the beginning of the article, the beginning of the article as much as possible using a similar function of the abstract (You can learn NetEase&#39;s article style). ?Such as &quot;based on open source jabber (XMPP) set up an internal instant messaging service solution&quot;; ?Part Title: &lt;title&gt; based on open source jabber (XMPP) set up an internal instant messaging service solutions - fertilizer Flanders (expendable) column - CSDNBlog &lt;/ title&gt; ?Keywords section: &lt;meta name=&quot;keywords&quot; cCOLOR: #c00000&quot;&gt; installation, &quot;&gt; ?Article description: &lt;meta name=&quot;description&quot; cCOLOR: #c00000&quot;&gt; the famous instant messaging service server, which is a free open source software, allows users to frame their own instant messaging server, Internet, applications can also be In the LAN application. ?XMPP (Extensible Messaging and Presence Protocol) is based on Extensible Markup Language (XML) protocol, which used to instant messaging (IM) and online site detection. It is in the promotion of the server of Between the quasi-real-time operation. This agreement may eventually allow Internet users to the Internet to send instant messages to any other person, even if the different operating systems and browsers. XMPP technology from In Jabber, it is in fact the core of the Agreement Jabber, so sometimes erroneously referred to as Jabber XMPP protocol. Jabber is an IM application based on XMPP protocol, in addition to outside Jabber, XMPP also supports many applications. ?
Pages to are hidden for
"Baidu_ Google search engine principles"Please download to view full document