www.cs.cornell.educoursescs4302006faslidesdis by yurtgc548


									    Discussion Class 5


                   Discussion Classes

          Ask a member of the class to answer.
          Provide opportunity for others to comment.
    When answering:
          Stand up.
          Give your name. Make sure that the TA hears it.
          Speak clearly so that all the class can hear.
          Do not be shy at presenting partial answers.
          Differing viewpoints are welcome.
                 Question 1: Objectives

    The TREC workshop series has four goals:
    (a) Encourage research in text based retrieval based on large test
    (b) Communication among industry, academia and government
    (c) Transfer of technology from research labs into products by
        demonstrating methodologies on real-world problems
    (d) Increase availability of appropriate evaluation techniques
    What does the ad hoc task contribute to each of these goals?

             Question 2: The TREC Corpus

    Source                             Size    # Docs      Median
                                    (Mbytes)             words/doc
    Wall Street Journal, 87-89          267     98,732        245
    Associated Press newswire, 89       254     84,678        446
    Computer Selects articles           242     75,180        200
    Federal Register, 89                260     25,960        391
    abstracts of DOE publications       184    226,087        111
    Wall Street Journal, 90-92          242     74,520        301
    Associated Press newswire, 88       237     79,919        438
    Computer Selects articles           175     56,920        182
    Federal Register, 88                209     19,860        396

         Question 2: The TREC Corpus

    (a) What characteristics of this data are likely to impact the
        results of experiments?
    (b) Explain the statement, "Disks 1-5 were used as training
    (c) Suppose that you were designing two search engines: (i)
        for use with a library catalog, (ii) for use with a Web
        search service. How does your data differ from the
        TREC corpus?

      Question 3: TREC Topic Statement

    <num> Number: 409
    <title> legal, Pan Am, 103
    <desc> Description:
    What legal actions have resulted from the destruction of Pan Am
    Flight 103 over Lockerbie, Scotland, on December 21, 1988?
    <narr> Narrative:
    Documents describing any charges, claims, or fines presented to
    or imposed by any court or tribunal are relevant, but documents
    that discuss charges made in diplomatic jousting are not relevant.

                   A sample TREC topic statement
       Question 3: TREC Topic Statement

    (a) What is the relationship between TREC topic statements and
    (b) Distinguish between manual and automatic methods of query
    (c) Explain the process used by the manual methods.
    (d) Some of the results used a time limit (e.g., "limited to no
        more than 10 minutes clock time"). What was being timed?

       Question 4: Relevance Assessments

    (a) Explain the statement, "All TRECs have used the pooling
        method to assemble the relevance assessments."
    (b) How is relevance assessed?
    (c) What is the impact of some relevant documents being missed
        from the pool?
    (d) What is the problem of some relevant documents in the pool
        coming from only a single run? How serious is this?

         Question 5: Evaluation

    QuickTime™ and a TIFF (LZW) decompressor are n eeded to see this picture.

                            Question 5:

     What are:
     (a) The recall-precision curve?
     (b) The mean (non-interpolated) average precision?
     The report commented that, "two topics are fundamental to
     effective retrieval performance." What are they?
     How do the automatic tests differ from the manual?

               Question 6: The future

     (a) Why was TREC-8 the last year for the ad hoc task?
     (b) Does this mean that text-based information retrieval is
         now solved?


To top