Towards the Self-Annotating Web by juanagui

VIEWS: 8 PAGES: 36

									Towards the Self-Annotating Web
  Philipp Cimiano, Siegfried Handschuh, Steffen Staab




                   Presenter: Hieu K Le
            (most of slides come from Philipp Cimiano)
                CS598CXZ - Spring 2005 - UIUC
              Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
The annotation problem
     in 4 cartoons
The annotation problem from a
      scientific point of view
The annotation problem
      in practice
The viscious cycle
         Annotating

               A
A Noun   ?   Concep
               t
              Annotating
• To annotate terms in a web page:
  – Manually defining
  – Learning of extraction rules
  Both require lot of labor
             A small Quiz
•   What is “Laska” ?
    A. A dish
    B. A city
    C. A temple
    D. A mountain


               The answer is:
             A small Quiz
•   What is “Laska” ?
    A. A dish
    B. A city
    C. A temple
    D. A mountain


               The answer is:
             A small Quiz
•   What is “Laska” ?
    A. A dish
    B. A city
    C. A temple
    D. A mountain


               The answer is:
     From Google
            “Laska”

70

60

50

40

30

20

10

 0
     dish   city   temple   mountain
              From Google
•   „cities such as Laksa“ 0 hits
•   „dishes such as Laksa“ 10 hits
•   „mountains such as Laksa“ 0 hits
•   „temples such as Laksa“ 0 hits

Google knows more than all of you together!
Example of using syntactic information +
 statistics to derive semantic information
               Self-annotating
• PANKOW (Pattern-based Annotation through Knowledge On the Web)
   –   Unsupervised
   –   Pattern based
   –   Within a fixed ontology
   –   Involve information of the whole web
   The Self-Annotating Web
• There is a huge amount of implicit
  knowledge in the Web
• Make use of this implicit knowledge
  together with statistical information to
  propose formal annotations and
  overcome the viscious cycle:
     semantics ≈ syntax + statistics?
• Annotation by maximal statistical
  evidence
              Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
PANKOW Process
              Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
                          Patterns

• HEARST1: <CONCEPT>s such as <INSTANCE>
• HEARST2: such <CONCEPT>s as <INSTANCE>
• HEARST3: <CONCEPT>s, (especially/including)
  <INSTANCE>
• HEARST4: <INSTANCE> (and/or) other <CONCEPT>s

• Examples:
   –   dishes such as Laksa
   –   such dishes as Laksa
   –   dishes, especially Laksa
   –   dishes, including Laksa
   –   Laksa and other dishes
   –   Laksa or other dishes
           Patterns (Cont‘d)

• DEFINITE1: the <INSTANCE> <CONCEPT>
• DEFINITE2: the <CONCEPT> <INSTANCE>

• APPOSITION:<INSTANCE>, a <CONCEPT>
• COPULA: <INSTANCE> is a <CONCEPT>

•   Examples:
•   the Laksa dish
•   the dish Laksa
•   Laksa, a dish
•   Laksa is a dish
         Asking Google (more
              formally)
• Instance iI, concept c C, pattern p 
  {Hearst1,...,Copula} count(i,c,p) returns the
  number of Google hits of instantiated pattern

                   count(i, c) :  count(i, c, p)
                                     p

                                                                          
  R : (i, ci ) | i  I , ci : arg max count(i, c)   count(i, c)   
                                    cC                                   
• E.g. count(Laksa,dish):=count(Laksa,dish,def1)+...
• Restrict to the best ones beyond threshold                    
              Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
        Evaluation Scenario
• Corpus: 45 texts from
  http://www.lonelyplanet.com/destinations

• Ontology: tourism ontology from GETESS project
   – #concepts: original – 1043; pruned – 682

• Manual Annotation by two subjects:
   – A: 436 instance/concept assignments
   – B: 392 instance/concept assignments
   – Overlap: 277 instances (Gold Standard)
   – A and B used 59 different concepts
   – Categorial (Kappa) agreement on 277 instances: 63.5%
                      Examples
Atlantic city 1520837       St John church 34021
Bahamas island 649166       Belgium country 33847
USA country 582275          San Juan island 31994
Connecticut state 302814    Mayotte island 31540
Caribbean sea 227279        EU country 28035
Mediterranean sea 212284    UNESCO organization 27739
Canada country 176783       Austria group 24266
Guatemala city 174439       Greece island 23021
Africa region 131063        Malawi lake 21081
Australia country 128607    Israel country 19732
France country 125863       Perth street 17880
Germany country 124421      Luxembourg city 16393
Easter island 96585         Nigeria state 15650
St Lawrence river 65095     St Croix river 14952
Commonwealth state 49692    Nakuru lake 14840
New Zealand island 40711    Kenya country 14382
Adriatic sea 39726          Benin city 14126
Netherlands country 37926   Cape Town city 13768
                            Results

 0,5
0,45
 0,4
0,35
 0,3
                 F=28,24%
0,25
                                                                             Precision
 0,2
                                                                             Recall
           R/Acc=24,90%
0,15
                                                                             F-Measure
 0,1
0,05
  0
       0
           90
                180
                      270
                            360
                                   450
                                         540
                                               630
                                                     720
                                                           810
                                                                 900
                                  threshold
                                                                       990
                       Comparison

System           #      Preprocessing / Cost     Accuracy


[MUC-7]          3      Various (?)              >> 90%
[Fleischman02]   8      N-gram extraction ($)    70.4%


PANKOW           59     none                     24.9%
[Hahn98] –TH     196    syn. & sem. analysis ($$$) 21%

[Hahn98]-CB      196    syn. & sem. analysis ($$$) 26%

[Hahn98]-CB      196    syn. & sem. analysis ($$$) 31%

[Alfonseca02]    1200   syn. analysis ($$)       17.39% (strict)
              Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
                CREAM/OntoMat
                  Annotation Environment                           WWW
                                                                   Web Pages
                                              Document
                                             Management
   Annotation
    Tool GUI                      annotate

                         plugin
Ontology                                      Annotation   crawl
Guidance   Document                           Inference
   &        Editor /     plugin                 Server               Annotated
 Fact       Viewer                 query                   load      Web Pages
Browser
                         plugin


  Annotation by Markup            extract
                                                                    Domain
                                                                   Ontologies
                                             PANKOW
  PANKOW &
CREAM/OntoMat
Results (Interactive Mode)

0,8

0,7

0,6
          F=51,65%
0,5
                           Precision (Top 5)
0,4
                           Recall (Top 5)
0,3                        F-Measure (Top 5)

0,2   R/Acc=49.46%
0,1

 0
 0


         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
       70
      14
      21
      28
      35
      42
      49
      56
      63
      70
      77
      84
      91
      98
               threshold
              Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
     Current State-of-the-art
• Large-scale IE [SemTag&Seeker@WWW„03]
  – only disambiguation
• Standard IE (MUC)
  – need of handcrafted rules
• ML-based IE (e.g.Amilcare@{OntoMat,MnM})
  – need of hand-annotated training corpus
  – does not scale to large numbers of concepts
  – rule induction takes time
• KnowItAll (Etzioni et al. WWW„04)
  – shallow (pattern-matching-based) approach
              Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
                      Conclusion
Summary
• new paradigm to overcome the annotation problem
• unsupervised instance categorization
• first step towards the self-annotating Web
• difficult task: open domain, many categories
• decent precision, low recall
• very good results for interactive mode
• currently inefficient (590 Google queries/instance)

Challenges:
• contextual disambiguation
• annotating relations (currently restricted to instances)
• scalability (e.g. only choose reasonable queries to Google)
• accurate recognition of Named Entities (currently POS-tagger)
              Outline
Introduction
The Process of PANKOW
Pattern-based categorization
Evaluation
Integration to CREAM
Related work
Conclusion
            Thanks to…
• Philipp Cimiano (cimiano@aifb.uni-
  karlsruhe.de) for slides
• The audience for listening

								
To top