Preliminary

Document Sample
Preliminary Powered By Docstoc
					                  CS5286 Search Engine Technology and Algorithms




Lecture 2:
Web (Social) Structure Mining


   Prof. Xiaotie Deng

 Department of Computer Science


                                                              1
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Outline

   HTTP
   Web Structure
   Web Mining
       FAN
       Page Rank
       Community
   Physical Structure of Internet
   Fighting Spam Web Pages


                                                                                  2
               CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




HTTP

   Hypertext Transfer Protocol
   Agent, Proxy, Cache
   MIME
   HTML, XML




                                                                             3
                             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




         HTTP :: Hypertext Transfer Protocol

   The Hypertext Transfer Protocol (HTTP/1.0)
       An application-level protocol
            with the lightness and speed necessary for distributed,
             collaborative, hypermedia information systems.
       A generic, stateless, object-oriented protocol
            which can be used for many tasks such as name servers and
             distributed object management systems, through extension of its
             request methods (commands).
       An important feature: the typing of data representation allows
        systems to be built independently of the data being
        transferred
       Based on request/response paradigm
       In use by the WWW global information initiative since 1990.
   http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.txt

                                                                                           4
                          CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     HTTP :: User Agent, Proxy, Cache
   User Agent:
       The client that initiates a request.
       E.g., browsers, editors, spiders (web-traversing robots)
   Proxy:
       An intermediary program which acts as both a server and a client for
        the purpose of making requests on behalf of other clients
       Interpret or rewrite a request message before forwarding it
       Often used as client-side portals through network firewalls and as
        helper applications for handling requests via protocols not
        implemented by user agent.
   Cache:
       local store of response messages

                                                                                        5
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        HTTP :: MIME

   Multipurpose Internet Mail Extensions
   The MIME type of a URL specifies the data to be
    transferred
      E.g., html files, images, sounds, Java classes

      see: http://www.faqs.org/rfcs/rfc1521.html

   It contains the major types used by http
   In response to a request, the server first sends back
    some information about the requested object, including
    its MIME type, before sending back the actual content

                                                                                   6
                        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




         HTTP :: HTML

   Hypertext Markup Language
   HTML is the lingua franca for publishing on the World
    Wide Web. Having gone through several stages of
    evolution, today's HTML has a wide range of features
    reflecting the needs of a very diverse and international
    community wishing to make information available on
    the Web.
       See: http://www.w3.org/MarkUp/
       A tutorial on HTML:
            http://www.cwru.edu/help/introHTML/toc.html

                                                                                      7
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Web Structure
   A Digraph
   Special Features:
       Huge
       Unknown
       Dynamic
   Directed Graph Functions
       Back Link
       Shorted Path
       Cliques
                                                                                8
                           CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Web Structure :: Digraph
   Nodes: web pages (URL)
   Directed Edges: hyperlink that from one web page to
    another
   Content of a node: the content contained in its
    associate web page
   dynamic nature of the digraph:
       For some nodes, there are outgoing edges which we don’t
        know yet.
            Nodes that not yet processed
            new edges (hyperlinks) may have added to some nodes
       For all the nodes, there are some incoming edges which we do
        not yet know.
                                                                                         9
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Web Structure :: Digraph :: Map
   To construct it, one needs
       a spider to automatically collect URLs
       a graph class to store information for nodes(URLs) and links
        (hyperlinks)
   The whole digraph (URLs,HYPERLINKs) is huge:
       162,128,493 hosts (2002 Jul data)

       One may need graph algorithms with secondary memories
       Google use thousands of workstations.




                                                                                       10
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Web Structure :: Digraph :: Map :: Illustration

            V0                      V1




 V2                   V3
                                                    V4




           V5
                                       V6



  An ordinary digraph H with 7 vertices and 12 edges

                                                                                11
                          CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Web Structure :: Digraph :: Partial Map
   Partial map may be enough for some purposes
       e.g., we are often interested in a small portion of the whole
        Internet
   A partial map may be constructed within the memory
    space for an ordinary PC. Therefore it may allow fast
    performance.

   However, we may not be able to collect all information
    that are necessary for our purpose
       e.g., back links to a URL


                                                                                        12
                           CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



      Web Structure :: Digraph ::
      Partial Map :: Illustration

                      V0                        V1
                                 2

                                                          10
              4            1
                                          3
       V2                        V3
                      2                           2              V4


                  5        8              4
                                                           6
                                     1
                  V5
                                                     V6

A partially unknown digraph H with 7 vertices and 12 edges but node v5
is not yet explored. We don’t know the outgoing edges from v5 though
we know its existence (by its URL).
                                                                                         13
            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Web Structure :: Features :: Huge :: 1




                                                                          14
            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Web Structure :: Features :: Huge :: 2




                                                                          15
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Web Structure :: Features :: Unknown

   Hyperlinks pointed to other web pages from each web
    page form a virtual directed graph
   Hyperlinks are added and deleted at will by individual
    web page authors
       Web pages may not know their incoming hyperlinks
       The Digraph is dynamic:
       Central control of the hyperlinks are not possible




                                                                                       16
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




       Web Structure :: Features :: Dynamic


   Search engines can only map a fraction of the whole
    web space
   Even if we can manage the size of the digraph, its
    dynamic nature requires constant update of the map
   There are web pages that are not documented by any
    search engine.




                                                                                  17
                              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




         Web Structure :: Functions
   Back_Link (the_url):
        find out all the urls that point to the_url
   Shortest_Path (url1, url2):
        return the shortest path from url1 to url2
   Maximal_Clique (url):
        return a maximal clique that contains url
   In-Degree( url )
        return the number of links that point to the url.
        represents, to some extent, its popularity. The more a web page is
         pointed to by web pages, the more web authors are interested in

   Related data structures, algorithms and websites
        http://www.cs.cityu.edu.hk/~deng/2468.html



                                                                                            18
                           CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Web Structure :: Functions :: Back Link
   Hyperlinks on the web are forward types.
   One web page may not know the hyperlinks that point
    to itself
       authors of web pages can freely link to other documents in the
        cyberspace without consent from their authors
   Back links may be of value
       in scientific literature, SCI (Scientific Citation Index) is an
        important index for judgement of academic value of one
        academic article
            www.isinet.com
   It is not easy to find all the back links

                                                                                         19
                              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


         Web Structure :: Functions ::
         Back Link :: Discovery

   Provided in Advanced Search Features of Several
    Search Engines
       Search from Google is done by typing
            link: url
       as your keyword
       homework: find how to retrieval back links from other
        search engines
   Connectivity Server: fast linkage information
       http://cui.unige.ch/tcs/cours/algoweb/2002/articles/art_keller_vincent.pdf
   Surfing the Web Backwards
       http://www8.org/w8-papers/5b-hypertext-media/surfing/surfing.html
       http://www.iwebtool.com/backlink_checker


                                                                                            20
                                 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


     Web Structure :: Functions ::
     Back Link :: Information Via HTTP
   Section 10 Header field definition of Http1.0
       subsection 10.13 Referer:
            Referer request-header field allows the client to specify, for the
             server’s benefit, the address (URI) of the resource from which the
             Request-URI was obtained.
            This allows a server to generate lists of back-links to resources
             for interest, logging, optimized caching, etc.
            It also allows obsolete or mistyped links to be traced for
             maintenance
            The Referer field must not be sent if the Request-URI was
             obtained from a source that does not have its own URI, such as
             input from the user keyboard.
            For details, see
                  http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.txt

                                                                                               21
            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Web Mining

   Overview
   FAN
   Page Rank
   Community



                                                                          22
                          CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Web Mining :: Overview

   Some information is embedded in the digraph
       Usually, the hyperlinks from a web page to other web pages are
        chosen because they are important and contain useful related
        information in the view of the web page author
       e.g., fans of football may all have links pointing to their favorite
        foot teams
   Some basic technology tools for web structure mining:
       a spider
       a graph class
       some relevant graph algorithms
       a back link retrieval function

                                                                                        23
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Web Mining :: FAN :: Overview

   Fans of a web page often put a link toward the
    web page.
   It is usually done manually after a user have
    accessed the web page and looked at the
    content of the web page.




                                                                                 24
            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Web Mining :: FAN :: Illustration

                                              a web page



 fans




              Fans of a web page


                                                                          25
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Web Mining :: FAN :: Indicator of Popularity


   The more a web page’s fans are, the more
    popular it is.
   SCI (Scientific Citation Index), for example, is a
    well established method to rate the importance
    of a research paper published in international
    journals.
       It is somewhat controversial for importance since
        some important work may not be popular
       But it is a good indicator of the influences

                                                                                     26
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



        Web Mining :: FAN ::
        Indicator of Popularity :: An Objection

   Some of the more popular web pages are so
    well known that people may not put them in
    their web pages
   On the assumption that some web pages are
    more important than others, How to compared
       a web page linked to by important web pages
       another linked to by less important web pages?




                                                                                     27
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Web Mining :: Page Rank :: Factors

   Two influencing factors for the rank of a web
    page:
      The rank of web pages pointing to it

         The high the better

      The number of outgoing links in the web

       pages pointing to it
         The less the better




                                                                                 28
                           CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Web Mining :: Page Rank :: Definitions

   Web pages are ranked according to their Page              Ranks
    calculated as follows:
       Assume page A has pages T1...Tn which point to it (i.e., back links
        or citations).
       Choose a parameter d that is a damping factor which can be set
        between 0 and 1 (usually set d to 0.85)
       C(A) is defined as the number of links going out of page A. The Page
        Rank of a page A is given as follows:

   PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn) / C(Tn))




                                                                                         29
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




     Web Mining :: Page Rank :: Calculation

   Notice that the definition of PR(A) is cyclic.
       I.e., ranks of web pages are used to calculate the ranks of
        web pages,
   However, Page Rank or PR(A) can be calculated using
    a simple iterative algorithm, and corresponds to the
    principal eigenvector of the normalized link matrix of
    the web.
   It is reported that a Page Rank for 26 million web
    pages can be computed in a few hours on a medium
    size workstation.

                                                                                       30
              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Web Mining :: Page Rank ::
Calculation :: Example 1 :: Part 1
                            b




a




                                                c




                                                                            31
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


    Web Mining :: Page Rank ::
    Calculation :: Example 1 :: Part 2

   Start with PR(a)=1, PR(b)=1, PR(c) =1
       Apply PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
   For simplicity, set d=0, and recale that C(): outdegree
   After first iteration, we have
       PR(a)=1, PR(b)=1/2, PR( c) =3/2
   For the second iteration, we have
       PR(a)=3/2, PR(b)=1/2, PR( c)=1
   Subsequent iterations:
       a:1 b:3/4 c:5/4
       a:5/4 b:1/2 c:5/4
   in the limit
                                                                                       32
       PR(a)=6/5, PR(b)=3/5, PR(c)=6/5
                        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



    Web Mining :: Page Rank ::
    Calculation :: Example 1 :: Part 3

                                            b: C(b)=1


                                                   PR(b)=1
             a:C(a)=2

         PR(a)=1



                                                                PR(c)=1
UPDATE:
PR(a)=PR( c)/ C (c ) =1                                                c: C(c)=1
PR(b) = PR(a)/C(a)=1/2
PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/2


                                                                                      33
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



    Web Mining :: Page Rank ::
    Calculation :: Example 1 :: Part 4

                                          b: C(b)=1

                                                 PR(b)=1/2


            a:C(a)=2

        PR(a)=1



UPDATE:                                                          c: C(c)=1
PR(a)=PR( c)/ C (c ) =3/2
PR(b) = PR(a)/C(a)=1/2                                               PR(c)=3/2
PR( c )=PR(a)/C(a)+PR(b)/C(b)=1/2+1/2=1


                                                                                     34
                           CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



    Web Mining :: Page Rank ::
    Calculation :: Example 1 :: Part 5

                                                 b: C(b)=1

                                                        PR(b)=1/2


                a:C(a)=2

           PR(a)=3/2



UPDATE:                                                                 c: C(c)=1
PR(a)=PR( c)/ C (c ) =1
PR(b) = PR(a)/C(a)=3/4                                                      PR(c)=1
PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/4+1/2=5/4


                                                                                         35
                            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



       Web Mining :: Page Rank ::
       Bringing Order to the Web

   Used maps containing as many as 518 million of these hyperlinks.
   These maps allow rapid calculation of a web page's "Page Rank",
    an objective measure of its citation importance that corresponds
    well with people's subjective idea of importance.
   For most popular subjects, a simple text matching search that is
    restricted to web page titles performs admirably when Page Rank
    prioritizes the results (demo available at http://google.stanford.edu/).
   For the type of full text searches in the main Google system,
    Page Rank also helps a great deal.
   As reported in “The Anatomy of a Large-Scale Hypertextual Web
    Search Engine”, by Sergey Brin and Lawrence Page

                                                                                          36
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




      Web Mining :: Community :: Overview

   Does Densely Connected Web Pages Represent
    Communities?
   Inferring Web Communities from Link Topology
      http://citeseer.nj.nec.com/kleinberg98inferring.html

   Efficient Identification of Web Communities
      http://citeseer.nj.nec.com/flake00efficient.html

   Friends and Neighbors on the Web
      http://www.parc.xerox.com/istl/groups/iea/papers/

        web10/fnn.pdf


                                                                                   37
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Web Mining :: Community :: Problem Description


    Given some web pages, e.g. Sports

    Problem:                      find a community of
     related pages.

    Web Community
        entity of related web pages ( centers )
        a set of web pages that link (in either direction) to more
         web pages in the community than to pages outside the
         community
                                                                                   38
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


     Web Mining :: Community ::
     Idea 1 -- Complete Sub-Graphs


   there is a group of URLs such that
       each URL has a link to every other URL in
        the group
   This is an evidence that each author of
    the web page is interested in every other
    web pages in the sub-group


                                                                                  39
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


        Web Mining :: Community ::
        Idea 2 -- Complete Bipartite Sub-Graphs


   Complete Bipartite graph:
       two groups of nodes, U and V
       for each node u in U and each node v in V
            there is an edge from u to v
   References
       D. Gibson J. Kleinberg, and Raghavan. Inferring web
        communities from link topology, In Proc. 9th ACM Conference
        on Hypertext and Hypermedia, 1998.
       T. Murata. Finding Related Web Pages based on connectivity
        information from a search engine. The 10th WWW Conference
                                                                                       40
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


    Web Mining :: Community ::
    Idea 2 :: Steps



   Step 1: Search of fans using a search engine

   Step 2: Adding a new URL to centers

   Step 3: Identify web community



                                                                                41
                   CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


    Web Mining :: Community :: Idea 2 ::
    Steps :: 1/3 Search of fans using a search engine


   Use input URLs as initial centers

   Search URLs referring to all the centers
    by Back Link Search from the centers

   Fixed number of high-ranking URLs are
    selected as fans
                                                                                 42
               CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


Web Mining :: Community :: Idea 2 ::
Steps :: 2/3 Adding a new URL to centers

   Acquire fans’ HTML files through internet

   Extract hyperlinks in the HTML files

   Sort the hyperlinks in order of frequency

   Add Top-ranking hyperlink to centers

   Delete fans not referring to all the centers
                                                                             43
                  CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


     Web Mining :: Community :: Idea 2 ::
     Steps :: 3/3 Identify Web Community



   Repeat previous steps until few fans left

   Acquired centers are regarded as a WEB
    COMMUNITY



                                                                                44
              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



  Web Mining :: Community ::
  Idea 2 :: Result




                                                  Web community
                                                                  centers
fans

                                                                   Web
                                                                   Community
                                                                   Centers:
                                                                   many web
                                                                   pages go
                                                                   there

                                                                            45
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



        Web Mining :: Community ::
        Idea 2 :: Drawbacks

   Maximum clique is difficult to find (NP-hard problem)
   The rough idea of closely linked URLs is right but
    completely connected sub graph may not be. It is
    often the case that some links may be missing
       fans may not have hyperlinks to centers that are created
        after their web pages are created




                                                                                       46
                         CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



        Web Mining :: Community ::
        Idea 3 – Minimum Cut Paradigm

   Maximum clique is difficult to find (NP-hard problem)
   The rough idea of closely linked URLs is right but
    completely connected subgraph may not be. It is often
    the case that some links may be missing
       fans may not have hyperlinks to centers that are created after
        their web pages are created
   A minimum cut of a digraph (V,A) is a partition of the
    node set V into two subsets U and W such that the
    number of edges from U to W is minimized.
       It captures the notion of U and W are NOT closely linked.
       Therefore, nodes in U are more closely related than with
        nodes in W.
                                                                                       47
                     CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



     Web Mining :: Community ::
     Idea 3 :: General Approach

   Find a min-cut using maximum flow algorithm
   if the minimum cut is sufficiently large, keep it and
    report the nodes as a web community
   else
   remove the edges associated with the minimum cut to
    split the digraph into two connected components
   repeat on each of the two connected component




                                                                                   48
            CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




                a                 b                c




    j

                                                                    d




i
                                                               e




        h                                      f

                            g
                                                                          49
        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




           a                 b                 c




j

                                                               d




                                                           e




    h                                      f

                       g


                                                                      50
    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    a                b                 c




j

                                                       d




                                                  e




                                   f

               g

                                                                  51
                       CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




    Web Mining :: Community :: Idea 4

   A heuristic implementation of the minimum
    cut paradigm for web community
        Gray William Flake, Steve Lawrence, and C. Lee Giles
        Proceedings of the 6th International Conference
         on Knowledge Discovery and Data Mining (ACM
         SIGKDD2000) pp.150-160, August 2000, Boston,
         USA
   Efficient Identification of Web Community
   Methodology: Maximum Flow and Minimal
    Cuts
                                                                                     52
                             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



           Web Mining :: Community :: Idea 4 ::
           Methodology :: Page 1

              Maximum Flow: Digraph G = (V, E),
               capacity function c(u, v), source sV,
               sink t V,
              Pl:            find the maximum flow
               from s to t, obeying all capacity
               constraints. 1
                        u
                                 v

                               1                    1
                     3        1
                                                1
                 s       1                                      t
                                   1                    3
                     2        1
A flow s->t:                       1            1
                                                                                           53
              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Web Mining :: Community :: Idea 4 ::
Methodology :: Page 2

   Cut: a set of edges the removing of
    which will separate s and t
   Minimal cut: cut with minimal weight

                                         1
                                     1
      s                                            t
                                     3
                                         1



           A cut of weight 6
                                                                            54
             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Web Mining :: Community :: Idea 4 ::
Methodology :: Page 3

   Maximum flow and minimal cut theory:

          the maximum flow of the network
    is identical to the minimum cut that
    separates s and t.



                                                                           55
                CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Web Mining :: Community :: Idea 4 ::
Foundation Theorem – Community To Minimum Cut

   Theorem: A community, C, can be identified by
    calculating the s-t minimum cut of G with s and t
    being used as the source and sink, provided both s#
    and t# exceed the cut set size. After the cut,
    vertices that are reachable from s are in the
    community.

   s#: # of edges between s and (C-s)
   t#: # of edges between t and (V-C-s)

   Note: compute the minimum cut by computing the
    maximum flow (existed polynomial algorithms)
                                                                              56
                        CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



     Web Mining :: Community :: Idea 4 ::
     Algorithm :: Initial Graph

   (a): A virtual source s, linking to all
   (b): Input k seed web pages, find all pages (c) that link to
    (back_link from AltaVista) or from (extract from the HTML file)
    the seed set;
   Download their HTML files and all outbound links (d), linking to
   A virtual sink t (e).
   Along each link there is an edge
   Edges between vertices in (b) and (c) are bidirectional, others
    one way
   Capacity of edges from (a) to (b) and from (d) to (e) is
    sufficiently large; others one


                                                                                      57
                              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng



Web Mining :: Community :: Idea 4 ::
Algorithm :: Initial Graph :: Illustration




                                                                          …………….
      …………….


                ……….



                             ……




                                                 ….
   (a)         (b)         (c)                 (d)                     (e)


(a): Virtual source vertex; (b): seed web sites; (c):web sites link to
or from seeds; (d): references to sites not in (b) nor in (c); (e): virtual sink


                                                                                            58
                              CS5286 Search Engine Technology and Algorithms/Xiaotie Deng


         Web Mining :: Community :: Idea 4 ::
         Algorithm :: Details
   Procedure focused-crawl (graph G =(V, E); vertex s, t V)
   While # of iteration is less than desired do:
       Perform maximum flow analysis of G, yielding community , C.
       Identify non-seed vertex, v*C, with the highest in-degree relative to G.
     for all v C with in-degree equal to v*,
          Add v to seed set
          Add edge (s, v) to E with infinite capacity
     end for
       Identify non-seed vertex, u*, with the highest out-degree relative to G
      for all uC with out-degree equal to u*,
           Add u to seed set
           Add edge (s, u) to E with infinite capacity
      end for
       Re-crawl so that G uses all seeds
       Let G reflect new information from the crawl
   End while
   End procudure                                                                           59
               CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




Internet Physical Structures


   Overview

   Routing




                                                                             60
                           CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Internet Physical Structures :: Overview

   The hyperlink structure is a social one in the sense
    that the links are chosen by web page authors out of
    their own preferences
   The physical structure also exhibits interesting
    properties that are useful for efficient routing
    algorithms.
   Interested readers may refer to:
       H. Burch and B. Cheswick. Mapping the Internet. In IEEE Computer,
        April 1999.
       P. Francis S. Jamin C. Jin Y. Jin D. Raz Y. Shavitt L. Zhang, IDMaps:
        A Global Internet Host Distance Estimation Service, IEEE/ACM
        Transactions on Networking, October2001
                                                                                         61
                             CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Internet Physical Structures :: Routing


   Internet link structure follows power law.
   It may not be the best shape for routing
    the access request
       It scales poorly according to:
            http://100x100network.org/papers/akella-podc2003.pdf

   Could it be optimized ?


                                                                                           62
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Fighting Spam Web Pages
   In back_link, we saw a lot of junk links.
   How do we eliminate junk webpage
    information?
       Access patterns?
       User feedbacks?
   Junk page designer could fight your
    strategy, how do we handle that?

                                                                                  63
                    CS5286 Search Engine Technology and Algorithms/Xiaotie Deng




        Conclusion

   HTTP
   Web Structure
   Web Mining
       FAN
       Page Rank
       Community
   Physical Structure of Internet
   Fighting Spam Web Pages


                                                                                  64

				
DOCUMENT INFO