Towards the Self-Annotating Web
Document Sample


Towards the Self-Annotating Web
Philipp Cimiano, Siegfried Handschuh, Steffen Staab
Presenter: Hieu K Le
(most of slides come from Philipp Cimiano)
CS598CXZ - Spring 2005 - UIUC
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
The annotation problem
in 4 cartoons
The annotation problem from a
scientific point of view
The annotation problem
in practice
The viscious cycle
Annotating
A
A Noun ? Concep
t
Annotating
• To annotate terms in a web page:
– Manually defining
– Learning of extraction rules
Both require lot of labor
A small Quiz
• What is “Laska” ?
A. A dish
B. A city
C. A temple
D. A mountain
The answer is:
A small Quiz
• What is “Laska” ?
A. A dish
B. A city
C. A temple
D. A mountain
The answer is:
A small Quiz
• What is “Laska” ?
A. A dish
B. A city
C. A temple
D. A mountain
The answer is:
From Google
“Laska”
70
60
50
40
30
20
10
0
dish city temple mountain
From Google
• „cities such as Laksa“ 0 hits
• „dishes such as Laksa“ 10 hits
• „mountains such as Laksa“ 0 hits
• „temples such as Laksa“ 0 hits
Google knows more than all of you together!
Example of using syntactic information +
statistics to derive semantic information
Self-annotating
• PANKOW (Pattern-based Annotation through Knowledge On the Web)
– Unsupervised
– Pattern based
– Within a fixed ontology
– Involve information of the whole web
The Self-Annotating Web
• There is a huge amount of implicit
knowledge in the Web
• Make use of this implicit knowledge
together with statistical information to
propose formal annotations and
overcome the viscious cycle:
semantics ≈ syntax + statistics?
• Annotation by maximal statistical
evidence
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
PANKOW Process
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Patterns
• HEARST1: <CONCEPT>s such as <INSTANCE>
• HEARST2: such <CONCEPT>s as <INSTANCE>
• HEARST3: <CONCEPT>s, (especially/including)
<INSTANCE>
• HEARST4: <INSTANCE> (and/or) other <CONCEPT>s
• Examples:
– dishes such as Laksa
– such dishes as Laksa
– dishes, especially Laksa
– dishes, including Laksa
– Laksa and other dishes
– Laksa or other dishes
Patterns (Cont‘d)
• DEFINITE1: the <INSTANCE> <CONCEPT>
• DEFINITE2: the <CONCEPT> <INSTANCE>
• APPOSITION:<INSTANCE>, a <CONCEPT>
• COPULA: <INSTANCE> is a <CONCEPT>
• Examples:
• the Laksa dish
• the dish Laksa
• Laksa, a dish
• Laksa is a dish
Asking Google (more
formally)
• Instance iI, concept c C, pattern p
{Hearst1,...,Copula} count(i,c,p) returns the
number of Google hits of instantiated pattern
count(i, c) : count(i, c, p)
p
R : (i, ci ) | i I , ci : arg max count(i, c) count(i, c)
cC
• E.g. count(Laksa,dish):=count(Laksa,dish,def1)+...
• Restrict to the best ones beyond threshold
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Evaluation Scenario
• Corpus: 45 texts from
http://www.lonelyplanet.com/destinations
• Ontology: tourism ontology from GETESS project
– #concepts: original – 1043; pruned – 682
• Manual Annotation by two subjects:
– A: 436 instance/concept assignments
– B: 392 instance/concept assignments
– Overlap: 277 instances (Gold Standard)
– A and B used 59 different concepts
– Categorial (Kappa) agreement on 277 instances: 63.5%
Examples
Atlantic city 1520837 St John church 34021
Bahamas island 649166 Belgium country 33847
USA country 582275 San Juan island 31994
Connecticut state 302814 Mayotte island 31540
Caribbean sea 227279 EU country 28035
Mediterranean sea 212284 UNESCO organization 27739
Canada country 176783 Austria group 24266
Guatemala city 174439 Greece island 23021
Africa region 131063 Malawi lake 21081
Australia country 128607 Israel country 19732
France country 125863 Perth street 17880
Germany country 124421 Luxembourg city 16393
Easter island 96585 Nigeria state 15650
St Lawrence river 65095 St Croix river 14952
Commonwealth state 49692 Nakuru lake 14840
New Zealand island 40711 Kenya country 14382
Adriatic sea 39726 Benin city 14126
Netherlands country 37926 Cape Town city 13768
Results
0,5
0,45
0,4
0,35
0,3
F=28,24%
0,25
Precision
0,2
Recall
R/Acc=24,90%
0,15
F-Measure
0,1
0,05
0
0
90
180
270
360
450
540
630
720
810
900
threshold
990
Comparison
System # Preprocessing / Cost Accuracy
[MUC-7] 3 Various (?) >> 90%
[Fleischman02] 8 N-gram extraction ($) 70.4%
PANKOW 59 none 24.9%
[Hahn98] –TH 196 syn. & sem. analysis ($$$) 21%
[Hahn98]-CB 196 syn. & sem. analysis ($$$) 26%
[Hahn98]-CB 196 syn. & sem. analysis ($$$) 31%
[Alfonseca02] 1200 syn. analysis ($$) 17.39% (strict)
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
CREAM/OntoMat
Annotation Environment WWW
Web Pages
Document
Management
Annotation
Tool GUI annotate
plugin
Ontology Annotation crawl
Guidance Document Inference
& Editor / plugin Server Annotated
Fact Viewer query load Web Pages
Browser
plugin
Annotation by Markup extract
Domain
Ontologies
PANKOW
PANKOW &
CREAM/OntoMat
Results (Interactive Mode)
0,8
0,7
0,6
F=51,65%
0,5
Precision (Top 5)
0,4
Recall (Top 5)
0,3 F-Measure (Top 5)
0,2 R/Acc=49.46%
0,1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
70
14
21
28
35
42
49
56
63
70
77
84
91
98
threshold
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Current State-of-the-art
• Large-scale IE [SemTag&Seeker@WWW„03]
– only disambiguation
• Standard IE (MUC)
– need of handcrafted rules
• ML-based IE (e.g.Amilcare@{OntoMat,MnM})
– need of hand-annotated training corpus
– does not scale to large numbers of concepts
– rule induction takes time
• KnowItAll (Etzioni et al. WWW„04)
– shallow (pattern-matching-based) approach
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Conclusion
Summary
• new paradigm to overcome the annotation problem
• unsupervised instance categorization
• first step towards the self-annotating Web
• difficult task: open domain, many categories
• decent precision, low recall
• very good results for interactive mode
• currently inefficient (590 Google queries/instance)
Challenges:
• contextual disambiguation
• annotating relations (currently restricted to instances)
• scalability (e.g. only choose reasonable queries to Google)
• accurate recognition of Named Entities (currently POS-tagger)
Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
Thanks to…
• Philipp Cimiano (cimiano@aifb.uni-
karlsruhe.de) for slides
• The audience for listening
Get documents about "