Hidden Markov Models Applied to Information Extraction by jol32089


									         Hidden Markov Models
    Applied to Information Extraction

   Part I: Concept
       HMM Tutorial

   Part II: Sample Application
       AutoBib: web information extraction

                            Larry Reeve
                   INFO629: Artificial Intelligence
                       Dr. Weber, Fall 2004
                    Part I: Concept
                   HMM Motivation
   Real-world has structures and processes which
    have (or produce) observable outputs

     Usually sequential (process unfolds over time)
     Cannot see the event producing the output
           Example: speech signals

   Problem: how to construct a model of the
    structure or process given only observations
               HMM Background

   Basic theory developed and published in 1960s and 70s

   No widespread understanding and application until late 80s

   Why?
       Theory published in mathematic journals which were not widely read by
        practicing engineers

       Insufficient tutorial material for readers to understand and apply concepts
                            HMM Uses
   Uses
       Speech recognition
            Recognizing spoken words and phrases

       Text processing
            Parsing raw records into structured records

       Bioinformatics
            Protein sequence prediction

       Financial
            Stock market forecasts (price pattern prediction)
            Comparison shopping services
                       HMM Overview
   Machine learning method
                                                     State machine:
   Makes use of state machines

   Based on probabilistic models

   Useful in problems having sequential steps

   Can only observe output from states, not the
    states themselves
        Example: speech recognition
             Observe: acoustic signals
             Hidden States: phonemes
                (distinctive sounds of a language)
    Observable Markov Model Example
                                                      State transition matrix

   Weather                                                             Rainy   Cloudy   Sunny
                                                      Rainy             0.4     0.3      0.3
       Once each day weather is observed
            State 1: rain                            Cloudy            0.2     0.6      0.2
            State 2: cloudy
            State 3: sunny                           Sunny             0.1     0.1      0.8

       What is the probability the weather for
        the next 7 days will be:
            sun, sun, rain, rain, sun, cloudy, sun

       Each state corresponds to a physical
        observable event
Observable Markov Model
Hidden Markov Model Example
   Coin toss:
     Heads, tails sequence with 2 coins
     You are in a room, with a wall
     Person behind wall flips coin, tells result
          Coin selection and toss is hidden
          Cannot observe events, only output (heads, tails) from

       Problem is then to build a model to explain
        observed sequence of heads and tails
                   HMM Components

   A set of states (x‟s)
   A set of possible output symbols (y‟s)
   A state transition matrix (a‟s)
        probability of making transition from
         one state to the next

   Output emission matrix (b‟s)
        probability of a emitting/observing a
         symbol at a particular state

   Initial probability vector
        probability of starting at a particular state
        Not shown, sometimes assumed to be 1
HMM Components
            Common HMM Types
   Ergodic (fully connected):
       Every state of model can be reached in
        a single step from every other state of
        the model

   Bakis (left-right):
       As time increases, states proceed from
        left to right
          HMM Core Problems
   Three problems must be solved for HMMs to
    be useful in real-world applications

         1) Evaluation

         2) Decoding

         3) Learning
        HMM Evaluation Problem
   Purpose: score how well a given model matches
    a given observation sequence

       Example (Speech recognition):
            Assume HMMs (models) have been built for words
             „home‟ and „work‟.

            Given a speech signal, evaluation can determine the
             probability each model represents the utterance
         HMM Decoding Problem
   Given a model and a set of observations, what
    are the hidden states most likely to have
    generated the observations?

       Useful to learn about internal model structure,
        determine state statistics, and so forth
          HMM Learning Problem
   Goal is to learn HMM parameters (training)
       State transition probabilities
       Observation probabilities at each state

   Training is crucial:
       it allows optimal adaptation of model parameters to observed
        training data using real-world phenomena

   No known method for obtaining optimal parameters
    from data – only approximations
   Can be a bottleneck in HMM usage
        HMM Concept Summary
   Build models representing the hidden states of a
    process or structure using only observations

   Use the models to evaluate probability that a model
    represents a particular observation sequence

   Use the evaluation information in an application to:
    recognize speech, parse addresses, and many other
                  Part II: Application
                   AutoBib System
   Provide a uniform view of several computer science
    bibliographic web data sources

   An automated web information extraction system that
    requires little human input
       Web pages designed differently from site-to-site
       IE requires training samples

   HMMs used to parse unstructured bibliographic
    records into a structured format: NLP
Web Information Extraction
 Converting Raw Records
1) Provide seed database of structured records
2) Extract raw records from relevant Web pages
3) Match structured records to raw records
     To build training samples
4) Train HMM-based parser
5) Parse unmatched raw recs into structured recs
6) Merge new structured records into database
AutoBib Architecture
                        Step 1 - Seeding

   Provide seed database of structured records
       Take small collection of BibTeX format records and
        insert into database

       Cleaning step normalizes record fields
            Examples:
                 “Proc.”  “Proceedings”
                 “Jan”  “January”

       Manual step, executed once only
     Step 2 – Extract Raw Records
   Extract raw records from relevant Web pages
       User specifies
          Web pages to extract from
          How to follow „next page‟ links for multiple pages

       Raw records are extracted
            Uses record-boundary discovery techniques
                 Subtree of Interest = largest subtree of HTML tags
                 Record separators = frequent HTML tags
Tokenized Records

   (Replace all HTML tags with ^)
               Step 3 - Matching
    Match raw records R to structured records S

    Apply 4 tests (heuristic-based)
    1)   Match at least author in R to an author in S
    2)   S.year must appear in R
    3)   If S.pages exists, R must contain it
    4)   S.title is „approximately contained‟ in R
           Levenshtein edit distance – approximate string match
             Step 4 – Parser Training
   Train HMM-based parser
       For each pair of R and S that match, annotate tokens
        in raw record with field names

       Annotated raw records are fed into HMM parser in
        order to learn:
          State transition probabilities
          Symbol probabilities at each state
        Parser Training, continued
   Key consideration is HMM structure for navigating
    record fields (fields, delimiters)
       Special states
            start, end
       Normal states
            author, title, year, etc.

   Best structure found:
       Have multiple delimiter and tag states,
       one for each normal state
            Example: author-delimiter, author-tag
    Sample HMM
           (Method 3)

Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg
            Step 5 - Conversion
   Parse unmatched raw recs into structured recs
    using HMM parser

   Matched raw records can be directly converted
    without parsing because they were annotated in
    matching step
              Step 6 - Merging
   Merge new structured records into database

   Initial seed database has now grown

   New records will be used for improved
    matching on the next run
   Success rate:
         # of tokens labeled by HMM
         # of tokens labeled by person

   DBLP: 98.9%
       Computer Science Bibliography

   CSWD: 93.4%
       CompuScience WWW-Database
HMM Advantages / Disadvantages
   Advantages
       Effective
       Can handle variations in record structure
            Optional fields
            Varying field ordering

   Disadvantages
       Requires training using annotated data
            Not completely automatic
            May require manual markup
            Size of training data may be an issue
                       Other methods
   Wrappers
       Specification of areas of interest on Web page
       Hand-crafted

   Wrapper induction
       Requires manual training
       Not always accommodating to changing structure
       Syntax-based; no semantic labeling
    Application to Other Domains
   E-Commerce
       Comparison shopping sites
          Extract product/pricing information from many sites
          Convert information into structured format and store

          Provide interface to look up product information and
           then display pricing information gathered from many sites

       Saves users time
            Rather than navigating to and searching many sites, users
             can consult a single site
   Concept:
       Rabiner, L. R. (1989). A Tutorial on Hidden Markov
        Models and Selected Applications in Speech
        Recognition. Proceedings of the IEEE, 77(2), 257-285.

   Application:
       Geng, J. and Yang, J. (2004). Automatic Extraction
        of Bibliographic Information on the Web. Proceedings
        of the 8th International Database Engineering and
        Applications Symposium (IDEAS’04), 193-204.

To top