Docstoc

Building an Intelligent Web Theory and Practice

Document Sample
Building an Intelligent Web Theory and Practice Powered By Docstoc
					Building an Intelligent Web:
            Theory and Practice

                  Pawan Lingras
              Saint Mary’s University
                Rajendra Akerkar
  American University of Armenia and SIBER, India
                                                          Discipline




                                                  Mathematics and Statistics                            Management
             Computer Science




                                                                                                   Chapters 1 – 8 excluding
                                                                                                   shaded portion related to
  Research                    Graduate            Research                  Graduate                  mathematics and
                                                                                                       implementation.




                      Information                                      Chapters 1 – 8 excluding     Chapters 2, 4 – 8 excluding
Complete Book                             Web Mining                   shaded portion related to     shaded portion related to
                       Retrieval
                                                                           implementation.               implementation.




                   Chapters 1, 2, 3, 7
                        and 8            Chapters 4 - 8
Information Retrieval
             Create a list of words




                        Remove stop words




                            Stem words




               Calculate frequency of each stemmed
                               word




Figure 2.1 Transforming text document to a weighted list of keywords
Data Mining has emerged as one of the most exciting and dynamic
fields in computing science. The driving force for data mining is
the presence of petabyte-scale online archives that potentially
contain valuable bits of information hidden in them. Commercial
enterprises have been quick to recognize the value of this
concept; consequently, within the span of a few years, the
software market itself for data mining is expected to be in excess
of $10 billion. Data mining refers to a family of techniques used
to detect interesting nuggets of relationships/knowledge in data.
While the theoretical underpinnings of the field have been around
for quite some time (in the form of pattern recognition,
statistics, data analysis and machine learning), the practice and
use of these techniques have been largely ad-hoc. With the
availability of large databases to store, manage and assimilate
data, the new thrust of data mining lies at the intersection of
database systems, artificial intelligence and algorithms that
efficiently analyze data. The distributed nature of several
databases, their size and the high complexity of many techniques
present interesting computational challenges.
                1


              0.75
  Precision

               0.5


              0.25


                0
                     0.25   0.5            0.75   1
                                  Recall



Figure 2.43 Relationship between precision and recall
Semantic Web
      Semantic Web
The layer language model
    (Berners-Lee, 2001; Broekstra et al, 2001)
<h1>Student Service Centre</h1>

Welcome to the home page of the Student Service Centre.

The centre is located in the main building of the University.

You may visit us for assistance during working days.

<h2>Office hours</h2>

Mon to Thu 8am - 6pm<br>

Fri 8am - 2pm<p>

But note that centre is not open during the weeks of the

<a href=”. . .”>State Of Origin</a>.



            Figure 3.2 Example of a Web page of a Student Service Centre
<organization>

     <serviceOffered>Admission</serviceOffered>

     <organizationName>Student Service Centre</organizationName>

     <staff>

        <director>John Roth</director>

        <secretary>Penny Brenner</secretary>

     </staff>

</organization>




            Figure 3.3 Example of a Web page of a Student Service Centre
Figure 3.4 Representing classes and instances (Noy et al., 2001)
                                                    Edward
                 lecturer   @name
                                                    Bunker




                            course   @title        Algorithms




                            course                 Computati
                                     @title           onal
                                                    Algebra



                 lecturer   @name

                                                    Daniela
                                                     Frost




                                                   Nonlinear
                            course   @title
                                                   Analysis

root   college


                                                      Sam
                            @name
                                                     Hoofer




                                                    Discrete
                 lecturer   course   @title
                                                   Structures




                                                    Modern
                            course   @title
                                                    Algebra




                                                   Nonlinear
                            course   @title
                                                   Analysis




                 location                     Innsbruck
                   Queries 1 and 2
                                                      Edward
                 lecturer    @name
                                                      Bunker




                              course   @title        Algorithms




                              course                 Computati
                                       @title           onal
                                                      Algebra



                 lecturer    @name

                                                      Daniela
                                                       Frost




                                                     Nonlinear
                              course   @title
                                                     Analysis

root   college


                                                        Sam
                             @name                     Hoofer




                                                      Discrete
                 lecturer     course   @title
                                                     Structures




                                                      Modern
                              course   @title
                                                      Algebra




                                                     Nonlinear
                              course   @title
                                                     Analysis




                 location                       Innsbruck
                 Queries 3 and 4
                                                    Edward
                 lecturer   @name
                                                    Bunker




                            course   @title        Algorithms




                            course                 Computati
                                     @title           onal
                                                    Algebra



                 lecturer   @name

                                                    Daniela
                                                     Frost




                                                   Nonlinear
                            course   @title
                                                   Analysis

root   college


                                                      Sam
                            @name                    Hoofer




                                                    Discrete
                 lecturer   course   @title
                                                   Structures




                                                    Modern
                            course   @title
                                                    Algebra




                                                   Nonlinear
                            course   @title
                                                   Analysis




                 location                     Innsbruck
<?xml version="1.0"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

     xmlns:dc="http://purl.org/dc/elements/1.1/">

  <rdf:Description rdf:about="">

     <dc:title>

             Building an Intelligent Web: Theory and Practice

      </dc:title>

     <dc:creator> Rajendra Akerkar and Pawan Lingras </dc:creator>

  </rdf:Description>

</rdf:RDF>




                             Figure 3.26 Fragment of RDF
A RDF model for automobiles
<?xml version="1.0"?>

<rdf:RDF

  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

  xmlns:my="http://www.myvehicle.com/vehicle-schema/">



  <rdfs:Class rdf:about="#Vehicle"/>



  <rdfs:Class rdf:about="#Car">

     <rdfs:subClassOf rdf:resource="#Vehicle"/>

  </rdfs:Class>



  <rdf:Property rdf:about="#name">

     <rdfs:domain rdf:resource="#Vehicle"/>

  </rdf:Property>



  <rdf:Description rdf:about="#Ford">

     <rdf:type rdf:resource="#Car"/>

     <my:name>Ford Icon</my:name>

  </rdf:Description>



  <my:Truck rdf:about="#Mitsubishi">

      <my:name>Mitsubishi</my:name>

      <my:carry rdf:resource="#Mitsubishi"/>

  </my:Truck>

</rdf:RDF>




                  Figure 3.29 RDF/XML file for the automobile example
<?xml version="1.0"?>

<topicMap id="tmrf"

             xmlns       = 'http://www.topicmaps.org/xtm/1.0/'

             xmlns:xlink = 'http://www.w3.org/1999/xlink'>

<!--

       The map contains information about Technomathematics Research Foundation.

       We can include comment and narrative here…

-->

.... here my topics and my associations go ...

</topicMap>




Figure 3.30 A Topic Map document
(Adopted from http://topicmaps.bond.edu.au/docs/6/1)
Classification and Association
           Data Preparation

•   Database Theory
•   SQL
•   Data Transformation
•   http://www.ecn.purdue.edu/KDDCUP/data/
                Classification
• Find a rule, a formula, or black box classifier for
  organizing data into classes.
   – Classify clients requesting loans into categories
     based on the likelihood of repayment
   – Classify customers into Big or Moderate Spenders
     based on what they buy
   – Classify the customers into loyal, semi-loyal,
     infrequent based on the products they buy
• The classifier is developed from the data in the
  training set
• The reliability of the classifier is evaluated using
  the test set of data
              Classification
• ID3 Algorithm
  – Numerical Illustration
  – Application to a Small E-commerce Dataset
• C4.5 for Experimentation
• Other approaches
  – Neural Networks
  – Fuzzy Classification
  – Rough Set Theory
                 Association
• Market basket analysis
  – determine which things go together
• Transactions might reveal that
  – customers who buy banana also buy candles
  – cheese and pickled onions seem to occur frequently
    in a shopping cart
• Information can be used for
  – arranging a physical shop or structuring the Web site
  – for targeted advertising campaign
             Association

• Apriori Algorithm
• Demonstration for an E-commerce
  Application
Clustering
               Clustering
• Breaks a large database into different
  subgroups or clusters
• Unlike classification there are no
  predefined classes
• The clusters are put together on the basis
  of similarity to each other
• The data miners determine whether the
  clusters offer any useful insight
5


4


3


2


1


0
    0   1   2   3   4   5
            Statistical Methods

•   k – means
    – Numerical Example
    – Implementation
      •   Data Preparation
      •   Clustering
•   Other Methods
Neural Network Based Approaches


• Kohonen Self Organising Maps
  – Numerical Demonstration
  – Application to Web Data Collection
• Other Neural Network Based Approaches
Clustering of customers
                                                Web Mining




                 Web Content                   Web Structure                    Web Usage
                   Mining                        Mining                          Mining




                                                                  General
 Web Page                      Search Result                                                 Customized
                                                               Access Pattern
Content Mining                    Mining                                                    Usage Tracking
                                                                  Tracking
Web Usage Mining
High level web usage mining process
       (Srivastava et al., 2000)
Applications of web usage mining
 (Romanko, 2006; Srivastava et al., 2000)
140.14.6.11 - pawan [06/Sep/2001:10:46:07 -0300] "GET /s.htm HTTP/1.0" 200 2267


140.14.7.18 - raj [06/Sep/2001:11:23:53 -0300] "POST /s.cgi HTTP/1.0" 200 499
Clustering exercise
Classification exercise

                  Channel                    Recall   Precision
                  Finance                    44.3%    98.27%
                  Health                     52.3%    89.66%
                  Market                     49.1%    83.34%
                  News                       44.1%    89.27%
                  Shopping                   31.5%    91.31%
                  Specials                   60.2%    92.86%
                  Sport                      50.0%    91.93%
                  Surveys                    21.9%    92.66%
                  Theatre                    54.8%    94.63%

Table 6.8 Precision and recall for predicting user’s interest in channels
                           (Baglioni, et al., 2003)
Association exercise


     News          Minimum Maximum Mean        Standard
     Section       Requests Requests  Requests Deviation
     Science               1       97   2.3034    2.8184
     Culture               1      208   3.7878    5.9742
     Sports                1      318   5.6985   10.8360
     Economics             1      258   3.9335    7.2341
     International         1      208   3.3823    5.5540
     Local Lisbon          1      460   5.6883   11.5650
     Local Port            1      256   7.5984   13.2351
     Politics              1      208   3.3577    5.4101
     Society               1      367   4.2673    7.9853
     Education             1       90   2.6496   3.29090
Table 6.9 Summary statistics of requests to the Publico on-line newspaper
                        (Batista and Silva, 2002)
       The association mining showed strong associations between the following pairs:

   Politics and Society

   Politics and International News

   Politics and Sports

   Society and International News

   Society and Local Lisbon

   Society and Sports

   Society and Culture

   Sports and International News
Sequence Pattern Analysis of
        Web Logs
Web Content Mining
           Data Collection

•   Web Crawlers
•   Public Domain Web Crawlers
•   An Implementation of a Web Crawler
Architecture of a search engine
        (Romanko, 2006)
Other topics in Web Content Mining
•   Search Engines
    – How to prepare for and setup a search
      engine
    – Types and listings of search engines
      (freeware, remote hosting services,
      commercial)
•   Multimedia Information Retrieval
Web Structure Mining
0/10:    The site or page is probably new.

3/10:    The site is perhaps new, small in size and has very little or no worthwhile

         arriving links. The page gets very little traffic.

5/10:    The site has a fair amount of worthwhile arriving links and traffic volume. The

         site might be larger in size and gets a good amount of steady traffic with some

         return visitors.

8/10:    The site has many arriving links, probably from other high PageRank pages.

         The site perhaps contains a lot of information and has a higher traffic flow and

         return visitor rate.

10/10:   The Web site is large, popular and has an extremely high number of links

         pointing to it.
http://www.iprcom.com/papers/pagerank/
Index quality for different search engines
         (Henzinger, et al., 1999)
Index quality per page for different search engines

              (Henzinger, et al., 1999)
                    Page                         Freq.      Freq.    Rank
                                                 Walk2      Walk1    Walk1

www.microsoft.com/                                  3172      1600        1
www.microsoft.com/windows/ie/default.htm            2064      1045        3
www.netscape.com/                                   1991       876        6
www.microsoft.com/ie/                               1982      1017        4
www.microsoft.com/windows/ie/download/              1915       943        5
www.microsoft.com/windows/ie/download/all.htm       1696       830        7
www.adobe.com/prodindex/acrobat/readstep.html       1634       780        8
home.netscape.com/                                  1581       695       10
www.linkexchange.com/                               1574       763        9
www.yahoo.com/                                      1527      1132        2

     Table 8.2 Most frequently visited pages (Henzinger, et al., 1999)
           Site               Frequency       Frequency        Rank
                                Walk 2          Walk 1         Walk 1

www.microsoft.com                    32452          16917                1
home.netscape.com                    23329          11084                2
www.adobe.com                        10884           5539                3
www.amazon.com                       10146           5182                4
www.netscape.com                      4862           2307               10
excite.netscape.com                   4714           2372                9
www.real.com                          4494           2777                5
www.lycos.com                         4448           2645                6
www.zdnet.com                         4038           2562                8
www.linkexchange.com                  3738           1940               12
www.yahoo.com                         3461           2595                7

    Table 8.3 Most frequently visited hosts (Henzinger, et al., 1999)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:34
posted:9/24/2011
language:English
pages:94