Semantic Web

Document Sample
Semantic Web Powered By Docstoc
					Information Extraction from the WWW using
Machine Learning Techniques



Lee McCluskey, Dept of Informatics
email: lee@hud.ac.uk




             Departmet of Informatics, Univeristy of Huddersfield
 Motivation
General: The WWW is a virtually limitless mass of information
  aimed mainly for human consumption. It is desirable to
  make this information generally available for use by
  computer programs in order to provide higher levels of to
  service to people.
This supports the new area of “Semantic Technologies” –
  apparently the new “billion dollar” market..
   NOW: Desk Top + Client-Server Technologies
   COMING: Distributed Intelligent Services
Specific: This work is related to a Knowledge Transfer
  Partnership just starting with a local company called View
  Based Systems.

                   Departmet of Informatics, Univeristy of Huddersfield
Overview of Talk

 We will investigate Information Extraction: This is
  the process of extracting “meaningful” data from raw
  or semi-structured text
 We will investigate techniques from „similarity-
 based‟ Machine Learning to learn/extract meaning
 from traditional web page content
 Also, Information Agents: These are programs
 that can retrieve information from web sites using
 database-like queries and can integrate info from
 web sites to solve complex queries
               Departmet of Informatics, Univeristy of Huddersfield
    Information Extraction from the WWW – WHY?

Problem: You‟re on ebay and you want a toilet cistern & wash basin that have a
   combined width of under 90cm
Solution: waste all Sunday afternoon going through 673 entries for “toilet” looking
   for widths and cross checking with 923 entries for wash basin!


   Need a universally-recognised query language
   Need to avoid the problems of identity (!) with universally-
    accessible vocabularies
   Need to be able to reason with acquired knowledge



                           Departmet of Informatics, Univeristy of Huddersfield
Information Extraction from the WWW –
WHY?

Our (KTP) interest –
extract data from www related to a “theme” or
  subculture eg bee-keeping, role playing
  games, Northern Soul music..
We want to populate and maintain a central
  database with this information …


              Departmet of Informatics, Univeristy of Huddersfield
    Information Extraction from The Web
   Information extraction is the process of extracting
    “meaningful” data from raw or semi-structured text
   IE tasks form a spectrum ..
       EASIER                                                       HARDER

       “Feature                                         “Natural Language
       Extraction” - extract                            Understanding” - take
       a particular piece of                            raw (English) text from a
       data from a semi- or                             web page and turn into
       unstructured                                     some logic representing
       document and give                                its meaning.
       it an XML markup
       eg extract an
       address from an
       html web page. Departmet of Informatics, Univeristy of Huddersfield
            Information Extraction
            from The Web
                                                     STRUCTURED DATA

                                              tom      664     blue   BSc


                                              bill     345     grey   PhD



                                              dave     123     red    MSc
                  WRAPPERS
WEB PAGES
                                              sue      555     red    BA




            Departmet of Informatics, Univeristy of Huddersfield
    Information Extraction
   The Web‟s HTML content makes it difficult to
    retrieve and integrate data from multiple sources.
   An agent can use a wrapper to extract the
    information from the collection of similarly-looking
    Web pages.
   The wrapper ~ grammar of the data in the web site
    + code to utilize the grammar
   This is similar to turning the HTML => XML+
    grammar (DTD)

                   Departmet of Informatics, Univeristy of Huddersfield
Example of Automated Extraction
                   Source: HTML ======> Destination: XML
<h1> Residential Housing </h1>              <residential>
<ul>House For Sale                          <house>
   <li> location: Hebden Bridge                < location>
   <li> agent-phone: 01422 843222                  <city> Hebden Bridge </city>
                                                   <county> West Yorkshire </county>
   <li> listed-price: £350,000
                                                   <country> UK </country>
   <li> comments: Bijou residence on the       </location>
        edge of this popular little town...
                                               <agent-phone> 01422 843222
</ul>                                 wrapper </agent-phone>
<hr>                                            <listed-price> £350,000 </listed-price>
<ul> House For Sale                             <comments> Bijou residence on the
...                                                      edge of this popular little town...
</ul>                                           </comments>
...                                         </house>
                                             ...
                                            </residential>
 NB: XML + schema
 + recognised names Departmet of Informatics, Univeristy of Huddersfield
Information Extraction
How can we create wrappers to „extract meaningful data‟ from
   the current Web?

?? Write a wrapper to extract data …. BUT would have to write
   a tool for every type of data / every type of webpage eg a
   C program to process every eBay page on toilets and
   output widths.

No - This is far too specific!

?? Write a tool to learn wrappers by inducing the format of
   web pages and/or particular fields.

.. this is more general and maintainable
                      Departmet of Informatics, Univeristy of Huddersfield
Using „Rule Induction‟ to learn wrappers for html
pages
    The user is given or acquires „typical examples‟ of
     the web pages containing the content to be
     learned
    The user points out fields to be learned to the
     agent.
    The agent builds up a characterization of the
     formats from the examples and transforms this
     into a wrapper in the form of a set of rules
    The wrapper is used by the agent to recognize
     and extract data from similar web pages
                  Departmet of Informatics, Univeristy of Huddersfield
Rule Induction is an area of Machine
Learning
                               Machine Learning
                          Symbolic Learning                        Sub-symbolic learning
 Similarity-Based
 Learning                                Explanation-Based
                                         Learning


Learning by                Learning from
Observation                Examples
                                                                            Genetic
                                                                            Approaches

              Rule                                        Neural
              Induction                                   Networks




                              Departmet of Informatics, Univeristy of Huddersfield
Rule Induction from Examples
Roughly, the algorithm is as follows:

Input: a (large) number of +ve instances (examples)
    of concept C
+ (possibly) a number of –ve instances of C

Output: a characterization H of the examples forming
   the rule
                 H => C

                 Departmet of Informatics, Univeristy of Huddersfield
Actual IE Example: University of Southern California‟s
Info Sciences Institute (ISI)‟s “Information agent”
 SPECIFIC PROBLEM: travel planning using the Web as an
      information source. There are huge number of travel
      sites, with different types of information.
 - hotel and flight information,
 - airports that are closest to your destination,
 - directions to your hotel
 - weather in the destination city …ETC

    Information Agents are capable of retrieving and
    integrating info from web sites to solve complex
    queries or tasks eg “book my travel for my
    business trip next week”

 See the Heracles project (http://www.isi.edu/info-agents/)
                   Departmet of Informatics, Univeristy of Huddersfield
      Heracles’ Stalker inductive algorithm

This generates wrappers – in this
case rules that identify the start and
end of an item within a web page.
It uses
•EXAMPLES
•A HIERARCHICAL MODEL
(ONTOLOGY) OF WHAT TO
EXPECT IN A WEB PAGE



                       Departmet of Informatics, Univeristy of Huddersfield
 Example of training examples

Stalker is given examples of „items‟ it had to learn the wrapper for – eg
    examples of the item (or concept) “area code” of a tel no,

E1: 513 Pixco, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515
E2: 90 Colfax, <b> Palms </b>, Phone: ( 818 ) 508-1570
E3: 523 1st St., <b> LA </b>, Phone: 1-<b> 888 </b>-578-2293
E4: 403 La Tijera, <b> Watts </b>, Phone: ( 310 ) 798-0008

Stalker learns wrappers that detect the begin/end patterns of fields so
    that they can be used to „mine‟ data in unseen web pages



                       Departmet of Informatics, Univeristy of Huddersfield
    Problems with Wrapper Induction
ISI report some success with their
  travel Information Agent, and its IE
  process, BUT:
    Wrapper Brittleness – website format may
     change – maintenance is costly
    Background knowledge (token hierarchy) not
     strong
    Unsupervised Wrapper induction would be
     better
                 Departmet of Informatics, Univeristy of Huddersfield
    Summary
-   Information Extraction is the process of extracting
    “meaningful” data from raw or semi-structured text
-   Wrappers are programs (rules) which are attached
    to web pages to extract data
-   Machine Learning techniques can be used to
    create wrappers
-   There are still many problems with these methods
    – especially in the learning and maintaining of
    wrappers


                  Departmet of Informatics, Univeristy of Huddersfield
    Extra Reading
    http://www.isi.edu/info-agents/
    Learning to Extract Symbolic Knowledge from the
     World Wide Web. M. Craven, D. DiPasquo,
     D. Freitag, A. McCallum, T. Mitchell, K. Nigam
     and S. Slattery. AAAI-98. January 1998.
    “Hierarchical Wrapper Induction for Semi-
     structured Information Sources” Ion Muslea,
     Steven Minton, Craig A. Knoblock, Kluwer, 1999.
    See Kushmerick references – apparently he
     invented wrapper induction
                  Departmet of Informatics, Univeristy of Huddersfield
    Related Legal/ Ethical/ Professional/
    Methodological Issues

    Is it legal and/or ethical to automatically „harvest‟ data
     from the www and re-use or sell it? In what cases is it
     illegal?
    How does one automate checking the veracity of www
     data?
    Will website owners conceal their data if the practice
     becomes widespread?
    Future: do we really want distributed web intelligence?


                     Departmet of Informatics, Univeristy of Huddersfield

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:111
posted:2/5/2010
language:English
pages:20