An Exploratory Study of the W3C Mailing List Test Collection for by yurtgc548

VIEWS: 0 PAGES: 29

									        An Exploratory Study of the
   W3C Mailing List Test Collection for
Retrieval of Emails with Pro/Con Arguments

        Yejun Wu & Douglas W. Oard
        University of Maryland, College Park
                    Ian Soboroff
            National Institute of Standards and
                    Technology


      July 27-28, 2006   CEAS, Mountain View, CA
                                                          2


                           Outline

•   Build the test collection
•   Evaluate the test collection (intrinsic evaluation)
•   Use the test collection (extrinsic evaluation)
•   Next steps to improve the test collection
W3C Mailing List Corpus                         3

           w3c.org            NIST (6/2004)

     html-tidy@w3c.org
     semantic-web@w3c.org
     w3c-news@w3c.org
     …
     w3c-rdfcore-wg@w3c.org



     lists-000-9978864         Webpages
     lists-001-0094883
     …                         Unique DocIDs
     lists-003-9630221
               Parsing
     lists-000-9978864
     lists-001-0094883
                               174,311 emails
     …                         515MB
     lists-003-9630221
       IR Test Collection Design                      4




         Query
       Formulation        Information seeking:
                          • Documents
                          • Information needs

Docs    Automatic         • Interactive process
         Search




        Interactive      Measure system:
         Selection
                         2 variations: system, user
            IR Test Collection Design                                    5


       Topic Statement (by Assessors)            Freeze user.

                  Query
                Formulation
                                                 Test Collection:
                                                 • Documents
                Automatic                        • Topic statements
Docs             Search
                                                 • Relevance judgments
                                                 • Metric
               Ranked Lists

                                   Relevance Judgments
                Evaluation
                                   (by Assessors)


  Evaluation Metric (Mean Average Precision)
DOCNO="lists-000-9978864”
                                                                                  6
RECEIVED="Sat Mar 18 08:56:28 2000"
ISORECEIVED="20000318135628"
SENT="Fri, 10 Mar 2000 13:26:29 -0500 (EST)"
ISOSENT="20000310182629"
NAME="Kerri Golden"
EMAIL="KGolden@Hynet.com"
SUBJECT="RTF Word 2000 spec?"
ID="C14D28BA032AD3118BE000104B87DDEC18A39A@solomon.hynet.com"
EXPIRES="-1”
TO=“html-tidy@w3.org”

We are trying to convert Word 2000 docs to XML.
Our converter worked fine for W97 documents, but W2000 has a much different RTF
format (tables especially).
Does anyone know where I can get a hold of a spec for this version of RTF?
thanks
Kerri Golden
kgolden@hynet.com
                                                   7



             Topic Statement
TopicID: DS8
Query: html vs. xhtml
Narrative: A relevant message will compare the
  advantages/disadvantages of the two standards.
                                                                                           8

          Pool Top 50 Docs/Run for Relevance Judgments

          Team1 Run1              Team2 Run3               …       Team 12 Run2

1        lists-000-9978864       lists-009-8065221        ...          lists-000-7643767
2        lists-000-7643767       lists-006-2570023                     lists-012-2365001
3        lists-011-6087388       lists-000-9978864        …            lists-004-0205442
4        lists-012-1019722       …                                     lists-003-6603021
…        …                       lists-012-2365001                     …
50       lists-008-2365001       lists-005-5500248        …            lists-009-8065221




                 Average 529 emails/topic

                                            Researchers as assessors
 Relevance Judgments
lists-000-9978864 Topic: , Pro/Con: 
lists-000-7643767 Topic: , Pro/Con: 
…                                                         12 teams*3 runs = 36 runs
Lists-008-2365001 Topic: , Pro/Con: 
                                                                                                9
                             Use of Test Collection
Measure systems of ranked retrieval

Topic DS1                                                     System A               B
 Rank                 Doc#          Score        Rel? Prec.   Topic       AP         AP       A-B
 1        lists-000-9743321 0.95                   1.00      DS1        0.73       0.50       +
 2        lists-000-7456300 0.91                   1.00      DS2        0.45       0.38       +
 3        lists-001-3400432 0.88                              DS3        0.56       0.36       +
 4        lists-002-6590811 0.82                              DS4        0.00       0.09        -
 5        lists-004-5566320 0.80                     0.60    DS5        0.13       0.10       +
 6        lists-009-1349620 0.77                              DS6        1.00       0.83       +
 7        lists-011-0383209 0.63                     0.57    DS7        0.24       0.28        -
 8        lists-005-5201023 0.62                              DS8        0.47       0.20       +
 9        lists-007-5610095 0.55                              DS9        0.53       0.41       +
 10       lists-002-3204102 0.51                     0.50    DS10       0.23       0.30        -
  --------------------------------------------                 -------     ------   ------   ------
  Avg. Prec. (AP):                  0.73                       MAP:       0.43      0.35     N+=7
                                                                                             N-=3

               Difference is not significant (two-tailed, p<0.05)
                                                                 10


            Emerging Topic Types
Type/Category: Method, tip, solution
• Example1:
  Query: Annotea installation
  Narrative: A relevant message will provide at least a tip on
  Annotea installation.
• Example2:
• Query: file upload http
• Narrative: A relevant message will discuss methods of doing
  file uploads using http.
                                                                                       11

                   Topic Type Analysis
Find categories amenable to pro/con classification


                              Number of Topics in Categories

  A: Comparions, usefulness, relationships

               B: Methods, tips, solutions

                       C: Discuss an issue
                     D: Problems, impacts

                E: Definitions, functionality

            F: Reasons, design rationales

                                                0   5   10      15      20   25   30
                                                             Category
                                                                                12
                       Measuring Agreement
           lists-000-9874732             lists-000-9874732       
           lists-001-0683001             lists-001-0683001
           lists-003-0000221              lists-003-0000221
           lists-004-8436200              lists-004-8436200       
           …                     …        …                       …
           lists-002-8833514             lists-002-8833514       

                   Judge1                Chance corrected overlap
               R       NR                Cohen’s Kappa=
Judge2 R        a      b       a+b                  (a  c)(c  d ) (b  d )(c  d )
                                         (a  d )                 
       NR       c      d       c+d                        N                N
                                                 (a  c)(c  d ) (b  d )(c  d )
               a+c b+d a+b+c+d=N
                                            N                  
                                                       N                N
   a
                    Overlap                          Kappa
 b a c          Non            Perfect     Inverse     Chance         Perfect

                   0              1            -1             0          1
                                                                                                13

     Assessor Agreement by Category
Overlap              0.8

                     0.6                                                             pro/con
             Kappa


                     0.4                                                             topical

                     0.2                                                             3 categ

                       0
                           A(26)   B(10)    C(8)      D(4)      E(2)    F(1)   All
                                           Categories (Num of Topics)




Kappa                1.0
                     0.8
                                                                                      pro/con
          Overlap




                     0.6
                                                                                      topical
                     0.4
                                                                                      .
                     0.2
                     0.0
                           A(26)   B(10)    C(8)      D(4)      E(2)    F(1)   All
                                           Categories (Num of Topics)



  Correlation b/t Overlap and kappa >0.9, significant at p<0.01
                                                                       14


       Effect of Disagreement on Ranking
    Primary Judge    Secondary Judge
                                       W3C                    Tau

1                                1     Topical relevance       0.763
2                                2                            (Significant)
                                       Pro/con relevance       0.776
3                                3                            (significant)
4                                4     Typical text retrieval “Identical” if
                                                              >= 0.9
5                                5
                                                   Important difference
                                                   in relevance judgment

                   min( pairwise _ adj _ swaps)
Kendall’s Tau = 1-                              = 1- 3/5 = 0.4
                                N
                                                   15



                       Outline

• Intrinsic evaluation
  -- topic type analysis
  -- inter-assessor agreement analysis
• Extrinsic evaluation:
  --Use W3C to evaluate a topic & pro/con system
                                                                            16
            Experiment Design: Round Robin
                                 
           Topic               Pro/Con     Non-
                     Pro/                  Pro/
                     Con       feature     Con
  48
Training             …                     …
                                                                  48-fold
Topics                         Pro/Con
           Topic                           Non-                   Cross-
                     Pro/      feature     Pro/
                     Con                   Con
                                                                 Validation
                            Top N terms (N=100)

1 Evaluation Topic           INQUERY Query
                               Search

                             Ranked List

                             Evaluation           Query relevance set
                                                  (Relevance Judgments)
                                MAP
                 Compare Two Systems
Topic Retrieval (Baseline):
Query: 100% topic terms: Browser technology support incompatibility
MAP= 0.2743

Topic + Pro/Con Retrieval (Rocchio):
Query: 30% topic terms: Browser technology support incompatibility
       70% pro/con terms: advantage, strength, weakness …
MAP= 0.2857

4.3% relative improvement.
Sig. (p<0.05, Wilcoxon signed-rank test)
All Topics
                                                  Topic Type A

                        0.2
                                                       Topic + Pro/Con System Better
Difference in Average




                        0.1
       Precision




                          0
                               8




                                                       2
                                   48
                                        42
                                             37


                                                  43




                                                            54
                                                                 20




                                                                          41




                                                                                       59


                                                                                            49
                                                                                                 15
                                                                                                      24
                                                                  Baseline Better
                        -0.1
                                    Topic Type B

                        0.2
                                         Topic + Pro/Con System Better
Difference in Average




                        0.1
       Precision




                          0




                                                                                      .
                                                                 7
                               16




                                    19




                                                    21



                                                            30




                                                                     47
                                                                          59




                                                                               17
                                                                                    10
                                              Baseline Better
                        -0.1
                               Topic Type C

                        0.2
                                         Topic + Pro/Con System Better
Difference in Average




                        0.1
       Precision




                          0
                               14

                                    27




                                                                31
                                                                     50
                                                                          26

                                                                               34


                                                                                    45




                                                                                         25
                                              Baseline Better
                        -0.1
                                   Topic Type D

                        0.2
                                   Topic + Pro/Con System Better
Difference in Average




                        0.1
       Precision




                          0
                                       33



                                                   44




                                                                   56
                               .




                                                                        .
                                            Baseline Better
                        -0.1
                                       Topic Type E

                        0.2
                                       Topic + Pro/Con System Better
Difference in Average




                        0.1
       Precision




                          0
                                   4




                                             3




                                                            5
                               .




                                                                       .
                                              Baseline Better
                        -0.1
                                                                   24
   Effects of Topic and Topic Types
 Overlap                          Kappa
             Topic   Topic Type               Topic   Topic Type

 Pro/Con                Sig.      Pro/Con                Sig.
 Relevance           (p<0.05)     Relevance           (p<0.05)
 Agreement                        Agreement
 Topical                          Topical
 Relevance                        Relevance
 Agreement                        Agreement


• Two-way ANOVA
• Topic difficulty levels: 27 improved, 16 hurt, 7 unused
• Topic types: A, B, C, D, E, F
                                                         25


    Conclusion – Test Collection Evaluation
• Test collection generally useful
• Important differences in judgments
• Relevance judgments could be improved
• Topic type: factor of agreement of pro/con relevance
• Categories less of a pro/con nature:
  -- B (method, tip, solution) : not lead to pro/con
  -- C (discuss an issue) : vague
• Rocchio style system: 4.2% improvement in MAP
• Major improvements in A and E
• Pro/con relevance judgments useful.
                                                        26

   Future Work – Better Test Collection Design
• Balance topic types:
  -- half in A.
  -- F (reason, design rationale): 1 topic.
• Study information needs and search process
• Improve the process
   --e.g., better defining topics for pro/con
• Use within-category topics for training
  -- examine the quality of training data by category
• Other classification methods: SVM, Naïve Bayes
• Separate models for detecting pros and cons.
• THANKS!
                           Pro/Con Feature Selection
                 Topic1         Topic2             …        Topic48

Pro/Con docs      20              8                            15
Non Pro/Con
                                                   …           18
                 30               5

Topic Weight:    log(20+1)      log(5+1)           …          log(15+1)

“advantage”:
                TF=38+1        TF=30+1                         TF=40+1
Pro/Con docs
Non Pro/Con
                                                   …          TF=10+1
                TF=10+1        TF=28+1
                 39/20                   31/8                      41/15
       log21* log-------
                 11/30
                           +   log6* log--------   + …+   log16*log--------
                                         29/5                      11/18
  “strength” …                                                1       advantage
                                                              2        strength
  “Microsoft” …                                                       weakness
                                                              3
  “Html” …                                                    4          hate
                                                              5        opinion
  “opinion” …                                                 …           …
                                                              100        wow
  …
                                                 28


                  Feature Selection
• Pro/con feature vector term weight

 TopiciWeight  log[min(Nposi  1, Nnegi  1)]
log odds ratio:
                        TFposi  1
   48

   (TopiciWeight  log TFnegi  1 )
  i 1
                          Nposi

                          Nnegi
   Pos: Pro/Con relevant documents
   Neg: Non Pro/Con relevant documents
                                                               29


       Rocchio-style Implementation
• Appropriate for topic and pro/con retrieval.
• Baseline classifier to test the utility of test collection
• Expanded query:
                             n1                n2
                                    Ri         Si
     Q1  Q 0  (                      )
                             i 1   n1    i 1 n 2

•   Q0: initial query; Q1: expanded query.
•   Ri: vectors from positive docs
•   Si: vectors from negative docs
•   , : parameters

								
To top