# Towards the Self-Annotating Web by juanagui

VIEWS: 8 PAGES: 36

• pg 1
```									Towards the Self-Annotating Web
Philipp Cimiano, Siegfried Handschuh, Steffen Staab

Presenter: Hieu K Le
(most of slides come from Philipp Cimiano)
CS598CXZ - Spring 2005 - UIUC
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
The annotation problem
in 4 cartoons
The annotation problem from a
scientific point of view
The annotation problem
in practice
The viscious cycle
Annotating

A
A Noun   ?   Concep
t
Annotating
• To annotate terms in a web page:
– Manually defining
– Learning of extraction rules
Both require lot of labor
A small Quiz
A. A dish
B. A city
C. A temple
D. A mountain

A small Quiz
A. A dish
B. A city
C. A temple
D. A mountain

A small Quiz
A. A dish
B. A city
C. A temple
D. A mountain

70

60

50

40

30

20

10

0
dish   city   temple   mountain
•   „cities such as Laksa“ 0 hits
•   „dishes such as Laksa“ 10 hits
•   „mountains such as Laksa“ 0 hits
•   „temples such as Laksa“ 0 hits

Google knows more than all of you together!
Example of using syntactic information +
statistics to derive semantic information
Self-annotating
• PANKOW (Pattern-based Annotation through Knowledge On the Web)
–   Unsupervised
–   Pattern based
–   Within a fixed ontology
–   Involve information of the whole web
The Self-Annotating Web
• There is a huge amount of implicit
knowledge in the Web
• Make use of this implicit knowledge
together with statistical information to
propose formal annotations and
overcome the viscious cycle:
semantics ≈ syntax + statistics?
• Annotation by maximal statistical
evidence
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
PANKOW Process
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Patterns

• HEARST1: <CONCEPT>s such as <INSTANCE>
• HEARST2: such <CONCEPT>s as <INSTANCE>
• HEARST3: <CONCEPT>s, (especially/including)
<INSTANCE>
• HEARST4: <INSTANCE> (and/or) other <CONCEPT>s

• Examples:
–   dishes such as Laksa
–   such dishes as Laksa
–   dishes, especially Laksa
–   dishes, including Laksa
–   Laksa and other dishes
–   Laksa or other dishes
Patterns (Cont‘d)

• DEFINITE1: the <INSTANCE> <CONCEPT>
• DEFINITE2: the <CONCEPT> <INSTANCE>

• APPOSITION:<INSTANCE>, a <CONCEPT>
• COPULA: <INSTANCE> is a <CONCEPT>

•   Examples:
•   the Laksa dish
•   the dish Laksa
•   Laksa, a dish
•   Laksa is a dish
formally)
• Instance iI, concept c C, pattern p 
{Hearst1,...,Copula} count(i,c,p) returns the
number of Google hits of instantiated pattern

count(i, c) :  count(i, c, p)
p

                                                                  
R : (i, ci ) | i  I , ci : arg max count(i, c)   count(i, c)   
                            cC                                   
• E.g. count(Laksa,dish):=count(Laksa,dish,def1)+...
• Restrict to the best ones beyond threshold                    
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Evaluation Scenario
• Corpus: 45 texts from
http://www.lonelyplanet.com/destinations

• Ontology: tourism ontology from GETESS project
– #concepts: original – 1043; pruned – 682

• Manual Annotation by two subjects:
– A: 436 instance/concept assignments
– B: 392 instance/concept assignments
– Overlap: 277 instances (Gold Standard)
– A and B used 59 different concepts
– Categorial (Kappa) agreement on 277 instances: 63.5%
Examples
Atlantic city 1520837       St John church 34021
Bahamas island 649166       Belgium country 33847
USA country 582275          San Juan island 31994
Connecticut state 302814    Mayotte island 31540
Caribbean sea 227279        EU country 28035
Mediterranean sea 212284    UNESCO organization 27739
Canada country 176783       Austria group 24266
Guatemala city 174439       Greece island 23021
Africa region 131063        Malawi lake 21081
Australia country 128607    Israel country 19732
France country 125863       Perth street 17880
Germany country 124421      Luxembourg city 16393
Easter island 96585         Nigeria state 15650
St Lawrence river 65095     St Croix river 14952
Commonwealth state 49692    Nakuru lake 14840
New Zealand island 40711    Kenya country 14382
Adriatic sea 39726          Benin city 14126
Netherlands country 37926   Cape Town city 13768
Results

0,5
0,45
0,4
0,35
0,3
F=28,24%
0,25
Precision
0,2
Recall
R/Acc=24,90%
0,15
F-Measure
0,1
0,05
0
0
90
180
270
360
450
540
630
720
810
900
threshold
990
Comparison

System           #      Preprocessing / Cost     Accuracy

[MUC-7]          3      Various (?)              >> 90%
[Fleischman02]   8      N-gram extraction (\$)    70.4%

PANKOW           59     none                     24.9%
[Hahn98] –TH     196    syn. & sem. analysis (\$\$\$) 21%

[Hahn98]-CB      196    syn. & sem. analysis (\$\$\$) 26%

[Hahn98]-CB      196    syn. & sem. analysis (\$\$\$) 31%

[Alfonseca02]    1200   syn. analysis (\$\$)       17.39% (strict)
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
CREAM/OntoMat
Annotation Environment                           WWW
Web Pages
Document
Management
Annotation
Tool GUI                      annotate

plugin
Ontology                                      Annotation   crawl
Guidance   Document                           Inference
&        Editor /     plugin                 Server               Annotated
Fact       Viewer                 query                   load      Web Pages
Browser
plugin

Annotation by Markup            extract
Domain
Ontologies
PANKOW
PANKOW &
CREAM/OntoMat
Results (Interactive Mode)

0,8

0,7

0,6
F=51,65%
0,5
Precision (Top 5)
0,4
Recall (Top 5)
0,3                        F-Measure (Top 5)

0,2   R/Acc=49.46%
0,1

0
0

0
0
0
0
0
0
0
0
0
0
0
0
0
70
14
21
28
35
42
49
56
63
70
77
84
91
98
threshold
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Current State-of-the-art
• Large-scale IE [SemTag&Seeker@WWW„03]
– only disambiguation
• Standard IE (MUC)
– need of handcrafted rules
• ML-based IE (e.g.Amilcare@{OntoMat,MnM})
– need of hand-annotated training corpus
– does not scale to large numbers of concepts
– rule induction takes time
• KnowItAll (Etzioni et al. WWW„04)
– shallow (pattern-matching-based) approach
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Conclusion
Summary
• new paradigm to overcome the annotation problem
• unsupervised instance categorization
• first step towards the self-annotating Web
• difficult task: open domain, many categories
• decent precision, low recall
• very good results for interactive mode
• currently inefficient (590 Google queries/instance)

Challenges:
• contextual disambiguation
• annotating relations (currently restricted to instances)
• scalability (e.g. only choose reasonable queries to Google)
• accurate recognition of Named Entities (currently POS-tagger)
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Thanks to…
• Philipp Cimiano (cimiano@aifb.uni-
karlsruhe.de) for slides
• The audience for listening

```
To top