# Hidden Markov Models Applied to Information Extraction by jol32089

VIEWS: 10 PAGES: 34

• pg 1
```									         Hidden Markov Models
Applied to Information Extraction

   Part I: Concept
   HMM Tutorial

   Part II: Sample Application
   AutoBib: web information extraction

Larry Reeve
INFO629: Artificial Intelligence
Dr. Weber, Fall 2004
Part I: Concept
HMM Motivation
   Real-world has structures and processes which
have (or produce) observable outputs

 Usually sequential (process unfolds over time)
 Cannot see the event producing the output
   Example: speech signals

   Problem: how to construct a model of the
structure or process given only observations
HMM Background

   Basic theory developed and published in 1960s and 70s

   No widespread understanding and application until late 80s

   Why?
   Theory published in mathematic journals which were not widely read by
practicing engineers

   Insufficient tutorial material for readers to understand and apply concepts
HMM Uses
   Uses
   Speech recognition
   Recognizing spoken words and phrases

   Text processing
   Parsing raw records into structured records

   Bioinformatics
   Protein sequence prediction

   Financial
   Stock market forecasts (price pattern prediction)
   Comparison shopping services
HMM Overview
   Machine learning method
State machine:
   Makes use of state machines

   Based on probabilistic models

   Useful in problems having sequential steps

   Can only observe output from states, not the
states themselves
   Example: speech recognition
   Observe: acoustic signals
   Hidden States: phonemes
(distinctive sounds of a language)
Observable Markov Model Example
State transition matrix

   Weather                                                             Rainy   Cloudy   Sunny
Rainy             0.4     0.3      0.3
   Once each day weather is observed
   State 1: rain                            Cloudy            0.2     0.6      0.2
   State 2: cloudy
   State 3: sunny                           Sunny             0.1     0.1      0.8

   What is the probability the weather for
the next 7 days will be:
   sun, sun, rain, rain, sun, cloudy, sun

   Each state corresponds to a physical
observable event
Observable Markov Model
Hidden Markov Model Example
   Coin toss:
 Heads, tails sequence with 2 coins
 You are in a room, with a wall
 Person behind wall flips coin, tells result
 Coin selection and toss is hidden
 Cannot observe events, only output (heads, tails) from
events

   Problem is then to build a model to explain
observed sequence of heads and tails
HMM Components

   A set of states (x‟s)
   A set of possible output symbols (y‟s)
   A state transition matrix (a‟s)
   probability of making transition from
one state to the next

   Output emission matrix (b‟s)
   probability of a emitting/observing a
symbol at a particular state

   Initial probability vector
   probability of starting at a particular state
   Not shown, sometimes assumed to be 1
HMM Components
Common HMM Types
   Ergodic (fully connected):
   Every state of model can be reached in
a single step from every other state of
the model

   Bakis (left-right):
   As time increases, states proceed from
left to right
HMM Core Problems
   Three problems must be solved for HMMs to
be useful in real-world applications

1) Evaluation

2) Decoding

3) Learning
HMM Evaluation Problem
   Purpose: score how well a given model matches
a given observation sequence

   Example (Speech recognition):
   Assume HMMs (models) have been built for words
„home‟ and „work‟.

   Given a speech signal, evaluation can determine the
probability each model represents the utterance
HMM Decoding Problem
   Given a model and a set of observations, what
are the hidden states most likely to have
generated the observations?

   Useful to learn about internal model structure,
determine state statistics, and so forth
HMM Learning Problem
   Goal is to learn HMM parameters (training)
   State transition probabilities
   Observation probabilities at each state

   Training is crucial:
   it allows optimal adaptation of model parameters to observed
training data using real-world phenomena

   No known method for obtaining optimal parameters
from data – only approximations
   Can be a bottleneck in HMM usage
HMM Concept Summary
   Build models representing the hidden states of a
process or structure using only observations

   Use the models to evaluate probability that a model
represents a particular observation sequence

   Use the evaluation information in an application to:
recognize speech, parse addresses, and many other
applications
Part II: Application
AutoBib System
   Provide a uniform view of several computer science
bibliographic web data sources

   An automated web information extraction system that
requires little human input
   Web pages designed differently from site-to-site
   IE requires training samples

   HMMs used to parse unstructured bibliographic
records into a structured format: NLP
Web Information Extraction
Converting Raw Records
Approach
1) Provide seed database of structured records
2) Extract raw records from relevant Web pages
3) Match structured records to raw records
   To build training samples
4) Train HMM-based parser
5) Parse unmatched raw recs into structured recs
6) Merge new structured records into database
AutoBib Architecture
Step 1 - Seeding

   Provide seed database of structured records
   Take small collection of BibTeX format records and
insert into database

   Cleaning step normalizes record fields
   Examples:
   “Proc.”  “Proceedings”
   “Jan”  “January”

   Manual step, executed once only
Step 2 – Extract Raw Records
   Extract raw records from relevant Web pages
   User specifies
 Web pages to extract from

   Raw records are extracted
   Uses record-boundary discovery techniques
   Subtree of Interest = largest subtree of HTML tags
   Record separators = frequent HTML tags
Tokenized Records

(Replace all HTML tags with ^)
Step 3 - Matching
    Match raw records R to structured records S

    Apply 4 tests (heuristic-based)
1)   Match at least author in R to an author in S
2)   S.year must appear in R
3)   If S.pages exists, R must contain it
4)   S.title is „approximately contained‟ in R
Levenshtein edit distance – approximate string match
Step 4 – Parser Training
   Train HMM-based parser
   For each pair of R and S that match, annotate tokens
in raw record with field names

   Annotated raw records are fed into HMM parser in
order to learn:
 State transition probabilities
 Symbol probabilities at each state
Parser Training, continued
   Key consideration is HMM structure for navigating
record fields (fields, delimiters)
   Special states
   start, end
   Normal states
   author, title, year, etc.

   Best structure found:
   Have multiple delimiter and tag states,
   one for each normal state
   Example: author-delimiter, author-tag
Sample HMM
(Method 3)

Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg
Step 5 - Conversion
   Parse unmatched raw recs into structured recs
using HMM parser

   Matched raw records can be directly converted
without parsing because they were annotated in
matching step
Step 6 - Merging
   Merge new structured records into database

   Initial seed database has now grown

   New records will be used for improved
matching on the next run
Evaluation
   Success rate:
# of tokens labeled by HMM
-------------------------------------
# of tokens labeled by person

   DBLP: 98.9%
   Computer Science Bibliography

   CSWD: 93.4%
   CompuScience WWW-Database
   Effective
   Can handle variations in record structure
   Optional fields
   Varying field ordering

   Requires training using annotated data
   Not completely automatic
   May require manual markup
   Size of training data may be an issue
Other methods
   Wrappers
   Specification of areas of interest on Web page
   Hand-crafted

   Wrapper induction
   Requires manual training
   Not always accommodating to changing structure
   Syntax-based; no semantic labeling
Application to Other Domains
   E-Commerce
   Comparison shopping sites
 Extract product/pricing information from many sites
 Convert information into structured format and store

 Provide interface to look up product information and
then display pricing information gathered from many sites

   Saves users time
   Rather than navigating to and searching many sites, users
can consult a single site
References
   Concept:
   Rabiner, L. R. (1989). A Tutorial on Hidden Markov
Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE, 77(2), 257-285.

   Application:
   Geng, J. and Yang, J. (2004). Automatic Extraction
of Bibliographic Information on the Web. Proceedings
of the 8th International Database Engineering and
Applications Symposium (IDEAS’04), 193-204.

```
To top