Semantic Web by Abbydoc


									Information Extraction from the WWW using
Machine Learning Techniques

Lee McCluskey, Dept of Informatics

             Departmet of Informatics, Univeristy of Huddersfield
General: The WWW is a virtually limitless mass of information
  aimed mainly for human consumption. It is desirable to
  make this information generally available for use by
  computer programs in order to provide higher levels of to
  service to people.
This supports the new area of “Semantic Technologies” –
  apparently the new “billion dollar” market..
   NOW: Desk Top + Client-Server Technologies
   COMING: Distributed Intelligent Services
Specific: This work is related to a Knowledge Transfer
  Partnership just starting with a local company called View
  Based Systems.

                   Departmet of Informatics, Univeristy of Huddersfield
Overview of Talk

 We will investigate Information Extraction: This is 
  the process of extracting “meaningful” data from raw 
  or semi-structured text
 We will investigate techniques from ‘similarity-
 based’ Machine Learning to learn/extract meaning 
 from traditional web page content
 Also, Information Agents: These are programs 
 that can retrieve information from web sites using 
 database-like queries and can integrate info from 
 web sites to solve complex queries
               Departmet of Informatics, Univeristy of Huddersfield
    Information Extraction from the WWW – WHY?

Problem: You’re on ebay and you want a toilet cistern & wash basin that have a 
   combined width of under 90cm
Solution: waste all Sunday afternoon going through 673 entries for “toilet” looking 
   for widths and cross checking with 923 entries for wash basin! 

n   Need a universally-recognised query language
n   Need to avoid the problems of identity (!) with universally-
    accessible vocabularies
n   Need to be able to reason with  acquired knowledge

                           Departmet of Informatics, Univeristy of Huddersfield
Information Extraction from the WWW –

Our (KTP) interest – 
extract data from www related to a “theme” or 
  subculture eg bee-keeping, role playing 
  games, Northern Soul music..
We want to populate and maintain a central 
  database with this information …

              Departmet of Informatics, Univeristy of Huddersfield
    Information Extraction from The Web
n   Information extraction is the process of extracting 
    “meaningful” data from raw or semi-structured text
n   IE tasks form a spectrum ..
       EASIER                                                       HARDER

       “Feature                                         “Natural Language
       Extraction” - extract                            Understanding” - take
       a particular piece of                            raw (English) text from a
       data from a semi- or                             web page and turn into
       unstructured                                     some logic representing
       document and give                                its meaning.
       it an XML markup
       eg extract an
       address from an
       html web page. Departmet of Informatics, Univeristy of Huddersfield
            Information Extraction
            from The Web
                                                     STRUCTURED DATA

                                              tom      664     blue   BSc

                                              bill     345     grey   PhD

                                              dave     123     red    MSc
                                              sue      555     red    BA

            Departmet of Informatics, Univeristy of Huddersfield
    Information Extraction
n   The Web’s HTML content  makes it difficult to 
    retrieve and integrate data from multiple sources. 
n   An agent can use a wrapper to extract the 
    information from the collection of similarly-looking 
    Web pages.
n   The wrapper ~ grammar of the data in the web site 
    + code to utilize the grammar 
n   This is similar to turning the HTML => XML+ 
    grammar (DTD)

                   Departmet of Informatics, Univeristy of Huddersfield
Example of Automated Extraction
                   Source: HTML ======> Destination: XML
<h1> Residential Housing </h1>              <residential>
<ul>House For Sale                          <house>
   <li> location: Hebden Bridge                < location>
   <li> agent-phone: 01422 843222                   <city> Hebden Bridge </city>
                                                    <county> West Yorkshire </county>
   <li> listed-price: £350,000
                                                    <country> UK </country>
   <li> comments: Bijou residence on the       </location>
        edge of this popular little town...
                                               <agent-phone> 01422 843222
</ul>                                 wrapper </agent-phone>
<hr>                                             <listed-price> £350,000 </listed-price>
<ul> House For Sale                              <comments> Bijou residence on the
...                                                       edge of this popular little town...
</ul>                                            </comments>
...                                         </house>
 NB: XML + schema
 + recognised names Departmet of Informatics, Univeristy of Huddersfield
Information Extraction
How can we create wrappers to ‘extract meaningful data’ from 
   the current Web?

?? Write a wrapper to extract data …. BUT would have to write
   a tool for every type of data / every type of webpage eg a
   C program to process every eBay page on toilets and
   output widths.

No - This is far too specific!

?? Write a tool to learn wrappers by inducing the format of
   web pages and/or particular fields.

.. this is more general and maintainable
                      Departmet of Informatics, Univeristy of Huddersfield
Using ‘Rule Induction’ to learn wrappers for html
 n   The user is given or acquires ‘typical examples’ of
     the web pages containing the content to be
 n   The user points out fields to be learned to the
 n   The agent builds up a characterization of the
     formats from the examples and transforms this
     into a wrapper in the form of a set of rules
 n   The wrapper is used by the agent to recognize
     and extract data from similar web pages
                  Departmet of Informatics, Univeristy of Huddersfield
 Rule Induction is an area of Machine
                            Machine Learning
                            Symbolic Learning                        Sub-symbolic learning
   Learning                                Explanation-Based

  Learning by                Learning from
  Observation                Examples

                Rule                                        Neural
                Induction                                   Networks

                                Departmet of Informatics, Univeristy of Huddersfield
Rule Induction from Examples
Roughly, the algorithm is as follows:

Input: a (large) number of +ve instances (examples) 
    of concept C
+ (possibly) a number of –ve instances of C

Output: a characterization H of the examples forming 
   the rule
                 H => C

                 Departmet of Informatics, Univeristy of Huddersfield
Actual IE Example: University of Southern California’s
Info Sciences Institute (ISI)’s “Information agent”
 SPECIFIC PROBLEM: travel planning using the Web as an 
      information source.  There are huge number of travel 
      sites, with different types of information. 
 - hotel and flight information, 
 - airports that are closest to your destination, 
 - directions to your hotel
 - weather in the destination city …ETC

    Information Agents are capable of retrieving and 
    integrating info from web sites to solve complex 
    queries or tasks eg “book my travel for my 
    business trip next week” 

 See the  Heracles project (
                   Departmet of Informatics, Univeristy of Huddersfield
      Heracles’ Stalker inductive algorithm

This generates wrappers – in this
case rules that identify the start and
end of an item within a web page.
It uses

                      Departmet of Informatics, Univeristy of Huddersfield
 Example of training examples

Stalker is given examples of ‘items’ it had to learn the wrapper for –
    eg examples of the item (or concept) “area code” of a tel no,

E1: 513 Pixco, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515
E2: 90 Colfax, <b> Palms </b>, Phone: ( 818 ) 508-1570
E3: 523 1st St., <b> LA </b>, Phone: 1-<b> 888 </b>-578-2293
E4: 403 La Tijera, <b> Watts </b>, Phone: ( 310 ) 798-0008

Stalker learns wrappers that detect the begin/end patterns of fields so
    that they can be used to ‘mine’ data in unseen web pages

                      Departmet of Informatics, Univeristy of Huddersfield
    Problems with Wrapper Induction
ISI report some success with their
  travel Information Agent, and its IE
  process, BUT:
n    Wrapper Brittleness – website format may
     change – maintenance is costly
n    Background knowledge (token hierarchy) not
n    Unsupervised Wrapper induction would be
                 Departmet of Informatics, Univeristy of Huddersfield
-   Information Extraction is the process of extracting 
    “meaningful” data from raw or semi-structured text
-   Wrappers are programs (rules) which are attached 
    to web pages to extract data
-   Machine Learning techniques can be used to 
    create wrappers
-   There are still many problems with these methods 
    – especially in the learning and maintaining of 

                  Departmet of Informatics, Univeristy of Huddersfield
    Extra Reading
n    Learning to Extract Symbolic Knowledge from the 
     World Wide Web. M. Craven, D. DiPasquo, D.  
     Freitag, A. McCallum, T. Mitchell, K. Nigam and S. 
     Slattery. AAAI-98. January 1998. 
n    “Hierarchical Wrapper Induction for Semi-
     structured Information Sources” Ion Muslea, 
     Steven Minton, Craig A. Knoblock, Kluwer, 1999.
n    See Kushmerick references – apparently he 
     invented wrapper induction
                   Departmet of Informatics, Univeristy of Huddersfield
    Related Legal/ Ethical/ Professional/
    Methodological Issues

n    Is it legal and/or ethical to automatically ‘harvest’ data
     from the www and re-use or sell it? In what cases is it
n    How does one automate checking the veracity of www
n    Will website owners conceal their data if the practice
     becomes widespread?
n    Future: do we really want distributed web intelligence?

                     Departmet of Informatics, Univeristy of Huddersfield

To top