Docstoc

Thesaurus Design and Constructio

Document Sample
Thesaurus Design and Constructio Powered By Docstoc
					Lecture 8: Thesaurus Design

                      SIMS 202:
               Information Organization
                     and Retrieval

            Prof. Ray Larson & Prof. Marc Davis
                     UC Berkeley SIMS
         Tuesday and Thursday 10:30 am - 12:00 pm
                         Fall 2003
          http://www.sims.berkeley.edu/academics/courses/is202/f03/
IS 202 – FALL 2003                                             2003.09.18 - SLIDE 1
Lecture Overview
• Review
       – Types of Controlled Vocabularies
       – Name Authority Control
• Thesaurus Design and Development
       – Controlled Vocabularies for topical description
       – Thesaurus Design
       – Steps In Thesaurus Development
       – Indexing
• Discussion
IS 202 – FALL 2003                               2003.09.18 - SLIDE 2
Lecture Overview
• Review
       – Types of Controlled Vocabularies
       – Name Authority Control
• Thesaurus Design and Development
       – Controlled Vocabularies for topical description
       – Thesaurus Design
       – Steps In Thesaurus Development
       – Indexing
• Discussion
IS 202 – FALL 2003                               2003.09.18 - SLIDE 3
Controlled Vocabularies
• Vocabulary control is the attempt to
  provide a standardized and consistent set
  of terms (such as subject headings,
  names, classifications, etc.) with the intent
  of aiding the searcher in finding
  information
• That is, it is an attempt to provide a
  consistent set of descriptions for use in (or
  as) metadata

IS 202 – FALL 2003                      2003.09.18 - SLIDE 4
Controlled Vocabularies
•    Names and name authorities
•    Gazetteers (geographic names)
•    Code lists (e.g., LC language codes)
•    Subject heading lists
•    Classification schemes
•    Thesauri




IS 202 – FALL 2003                          2003.09.18 - SLIDE 5
Name Authorities: The Problem

• Proliferation of the forms of names
       – Different names for the same person
       – Different people with the same names


• Examples
       – from Books in Print (semi-controlled but not
         consistent)
       – ERIC author index (not controlled)



IS 202 – FALL 2003                              2003.09.18 - SLIDE 6
Lecture Overview
• Review
       – Types of Controlled Vocabularies
       – Name Authority Control
• Thesaurus Design and Development
       – Controlled Vocabularies for topical description
       – Thesaurus Design
       – Steps In Thesaurus Development
       – Indexing
• Discussion
IS 202 – FALL 2003                               2003.09.18 - SLIDE 7
Uses of Controlled Vocabularies

• Library subject headings, classification
  and authority files
• Commercial journal indexing services and
  databases
• Yahoo, and other web classification
  schemes
• Online and manual systems within
  organizations
       – SunSolve
       – MacArthur

IS 202 – FALL 2003                   2003.09.18 - SLIDE 8
Indexing Languages
• An index is a systematic guide designed to
  indicate topics or features of documents in
  order to facilitate retrieval of documents or
  parts of documents
• An indexing language is the set of terms
  used in an index to represent topics or
  features of documents, and the rules for
  combining or using those terms


IS 202 – FALL 2003                      2003.09.18 - SLIDE 9
Types of Indexing Languages
• Uncontrolled keyword indexing
• Indexing languages
       – Controlled, but not structured
• Thesauri
       – Controlled and structured
• Classification systems
       – Controlled, structured, and coded
• Faceted classification systems

IS 202 – FALL 2003                           2003.09.18 - SLIDE 10
Thesauri
• A Thesaurus is a collection of selected
  vocabulary (preferred terms or descriptors)
  with links among synonymous, equivalent,
  broader, narrower and other related terms




IS 202 – FALL 2003                    2003.09.18 - SLIDE 11
Thesaurus Standards
• National and International Standards for
  Thesauri
       – ANSI/NISO z39.19-1994 — American National
         Standard Guidelines for the Construction, Format and
         Management of Monolingual Thesauri
       – ANSI/NISO Draft Standard Z39.4-199x — American
         National Standard Guidelines for Indexes in
         Information Retrieval
       – ISO 2788 — Documentation — Guidelines for the
         establishment and development of monolingual
         thesauri
       – ISO 5964 — Documentation — Guidelines for the
         establishment and development of multilingual
         thesauri


IS 202 – FALL 2003                                  2003.09.18 - SLIDE 12
Thesaurus Examples
• Examples
       – The ERIC Thesaurus of Descriptors
       – The Medical Subject Headings (MESH) of the
         National Library of Medicine
       – The Art and Architecture Thesaurus




IS 202 – FALL 2003                          2003.09.18 - SLIDE 13
ERIC Thesaurus – Entry




IS 202 – FALL 2003       2003.09.18 - SLIDE 14
ERIC Thesaurus – Alphabetic




IS 202 – FALL 2003            2003.09.18 - SLIDE 15
ERIC Thesaurus – KWIC Index




IS 202 – FALL 2003        2003.09.18 - SLIDE 16
ERIC Thesaurus – Hierarchies




IS 202 – FALL 2003         2003.09.18 - SLIDE 17
ERIC Thesaurus – Groups




IS 202 – FALL 2003        2003.09.18 - SLIDE 18
ERIC Thesaurus – Online

                     http://www.ericfacility.net/extra/pub/thessearch.cfm




IS 202 – FALL 2003                                      2003.09.18 - SLIDE 19
MESH – Entry




IS 202 – FALL 2003   2003.09.18 - SLIDE 20
MESH – Alphabetic




IS 202 – FALL 2003   2003.09.18 - SLIDE 21
MESH – Tree Structures




IS 202 – FALL 2003       2003.09.18 - SLIDE 22
MESH – KWOC Index




IS 202 – FALL 2003   2003.09.18 - SLIDE 23
MESH - Online




                     http://www.nlm.nih.gov/mesh/meshhome.html


IS 202 – FALL 2003                                2003.09.18 - SLIDE 24
AAT – Facets




IS 202 – FALL 2003   2003.09.18 - SLIDE 25
AAT – Hierarchies (print)




IS 202 – FALL 2003          2003.09.18 - SLIDE 26
AAT – Hierarchies (online)

                     http://www.getty.edu/research/tools/vocabulary/aat/




IS 202 – FALL 2003                                       2003.09.18 - SLIDE 27
AAT – Entry (online)




IS 202 – FALL 2003     2003.09.18 - SLIDE 28
Lecture Overview
• Review
       – Types of Controlled Vocabularies
       – Name Authority Control
• Thesaurus Design and Development
       – Controlled Vocabularies for topical description
       – Thesaurus Design
       – Steps In Thesaurus Development
       – Indexing
• Discussion
IS 202 – FALL 2003                              2003.09.18 - SLIDE 29
Why Develop a Thesaurus?
• To provide a conceptual structure or
  ―space‖ for a body of information
       – To make it possible to adequately describe
         the topical content of information resources at
         an appropriate level of generality or specificity
       – To provide enhanced search capabilities and
         to improve the effectiveness of searching (i.e.,
         to retrieve most of the relevant material
         without too much irrelevant material)


IS 202 – FALL 2003                                2003.09.18 - SLIDE 30
Why Develop a Thesaurus?
• To provide vocabulary (or terminological)
  control
       – When there are several possible terms
         designating a single concept, the thesaurus
         should lead the indexer or searcher to the
         appropriate concept, regardless of the terms
         they start with




IS 202 – FALL 2003                             2003.09.18 - SLIDE 31
Preliminary Considerations
• What is used now?
       – Continue using an existing thesaurus?
       – Ad hoc modification of existing thesaurus?
       – Develop a new well-structured thesaurus?
• What is the scope and complexity of the
  subject field?
• What kind of retrieval objects or data will
  be dealt with?
• How exhaustive and specific is the desired
  description of objects?
IS 202 – FALL 2003                             2003.09.18 - SLIDE 32
Preliminary Considerations
• The scope and complexity of the field will
  provide some indication of the scope and
  complexity of the thesaurus
       – It is better to plan for a larger and more
         comprehensive system than a smaller system
         that rapidly will become inadequate as the
         database grows
• Development of a good thesaurus requires
  a major intellectual effort as well as clerical
  operations like data entry and production
  of sorted lists
IS 202 – FALL 2003                          2003.09.18 - SLIDE 33
Development of a Thesaurus
• Term selection
• Merging and development of concept
  classes
• Definition of broad subject fields and
  subfields
• Development of classificatory structure
• Review, testing, application, revision



IS 202 – FALL 2003                    2003.09.18 - SLIDE 34
 Flow of Work in Thesaurus Construction

                                  Define Broad Subject
Select Sources                           Fields
                                                                    Improve Class Structure

                                  Sort Terms into Broad              Print Classified Index
Assign codes                         Subject Fields                        and review

                                  Define Subfields within           Discuss with Experts and
Select Terms                        one Subject Field                        Users

                                Work out detailed structure          Select descriptors and
Record Selected Terms
                                   of the Subject Field                 checklist items

                                 Select Preferred Terms                        Many           Yes
  Sort Terms                                                                Modifications?
                                                                                                        Revise as
                                                                                                         needed
                                                                                    No
                                                               No
                                      All Subfields of Broad
Merge identical Terms                   Subject finished?                Assign Notation
                                                  Yes
Merge Terms in Same                                                  Produce Full Thesaurus
   Concept class                                               No     and Check references
                                           All Broad
                                       Subjects finished?

 Based on Soergel, pp 327-333                     Yes
                                                                         Review and Test


 IS 202 – FALL 2003                                                                             2003.09.18 - SLIDE 35
1. Term Selection
• Select sources for the collection of terms
       – Prearranged Sources
       – Open-ended Sources
• Assign codes to each source
• Selection of terms
       – For part of pre-arranged and for all open-
         ended sources
• Enter terms into database with all
  information

IS 202 – FALL 2003                              2003.09.18 - SLIDE 36
1.1 Kinds of Sources
• Prearranged Sources
       – Existing descriptor lists, classification schemes
         thesauri
              • This includes universal schemes like DDC or LCSH
       –   Nomenclatures of single disciplines
       –   Treatises on the terminology of a field
       –   Encyclopedias, lexica, dictionaries and glossaries
       –   Tables of contents of textbooks and handbooks
       –   Indexes of journals or abstracting journals
       –   Indexes of other publications in the field



IS 202 – FALL 2003                                             2003.09.18 - SLIDE 37
1.1 Kinds of Sources
• Open-ended sources
       – Lists of search requests or interest profiles
       – Description of projects/activities to be served by the
         information retrieval system
       – Discussion with specialists in the field
       – Sample of documents in the field
              • Ask users why and how these documents relate to the field
              • Have documents indexed by experts in the field
       – Lists of titles of documents in the field
       – Abstracts and reviews of documents
       – Your own knowledge


IS 202 – FALL 2003                                               2003.09.18 - SLIDE 38
Selection of Sources
• Prearranged sources require less effort in
  gathering the material, and may already indicate
  some relationships between terms and concepts
  and relationships among terms
• Open-ended sources can reflect current
  terminology and may provide more complete
  coverage
• Choose a set of sources that are current, as
  complete as possible, and considered
  authoritative


IS 202 – FALL 2003                         2003.09.18 - SLIDE 39
Selection of Sources
• Each selected source is assigned an ID for
  tracking its use in the development of the
  thesaurus
       – Useful when making decisions about which
         terms to prefer
       – Useful for backtracking when questions arise
         (where did this come from?)




IS 202 – FALL 2003                             2003.09.18 - SLIDE 40
Selection of Terms
• Terms can be transferred directly from
  prearranged sources to the recording
  medium (cards or database)
       – Have to decide which terms and references to
         include, or to take the whole source




IS 202 – FALL 2003                            2003.09.18 - SLIDE 41
Selection of Terms
• In open-ended sources you read through
  the source and pick out terms (i.e. words
  and phrases) that might be useful in
  retrieval or as references to other terms
• Alternatively, use keyword and phrase
  extraction software to create lists of terms
  and select from those
• Transfer selected terms to the recording
  medium (cards or database)

IS 202 – FALL 2003                      2003.09.18 - SLIDE 42
Work Form – Still relevant??




IS 202 – FALL 2003               2003.09.18 - SLIDE 399
                               From Soergel, p. 43
2. Merging and Development of Concept Classes


• Sort Term DB into alphabetical order
• First Round
       – Merge information for identical terms, possibly
         pulling info from additional sources
• Second Round
       – Merge synonyms or terms in the same
         concept class




IS 202 – FALL 2003                              2003.09.18 - SLIDE 44
3. Definition of Broad Subject Fields and Subfields


• Define broad subject fields and sort terms into
  these broad fields
• Define subfields within each broad field and sort
  terms into these subfields
• Work out the detailed structure
       – Select preferred terms
       – Merge information for terms in the same concept
         class
• Repeat these steps
       – For each subfield within a broad field
       – And for each broad field
       – Until all terms have been consolidated and preferred
         terms selected

IS 202 – FALL 2003                                    2003.09.18 - SLIDE 45
4. Development of Classificatory Structure

• Produce preliminary version of classified
  index and update the working database
• Improve classificatory structure
• Reality check
       – Produce and distribute a version of the
         classified index
       – Distribute to users/experts




IS 202 – FALL 2003                                 2003.09.18 - SLIDE 46
5. Final Stages
•    Review
•    Testing
•    Application
•    Revision




IS 202 – FALL 2003   2003.09.18 - SLIDE 47
Review
• Discuss classified index with users/experts
       – Select descriptors and checklist descriptors
• Assign notational symbols
• Produce main thesaurus and indexes




IS 202 – FALL 2003                              2003.09.18 - SLIDE 48
Review (cont.)
• Check cross references and insert where
  needed
• Produce test version
• Test by indexing
• Modify as needed
• Produce production version




IS 202 – FALL 2003                  2003.09.18 - SLIDE 49
Testing a Thesaurus
• Assign descriptors to a sample set of NEW
  documents (use enough to get an idea of
  any gaps in the thesaurus)
• Test retrieval using sample questions and
  seeing how effectively the thesaurus maps
  to the appropriate descriptor




IS 202 – FALL 2003                  2003.09.18 - SLIDE 50
Lecture Overview
• Review
       – Types of Controlled Vocabularies
       – Name Authority Control
• Thesaurus Design and Development
       – Controlled Vocabularies for topical description
       – Thesaurus Design
       – Steps In Thesaurus Development
       – Indexing
• Discussion
IS 202 – FALL 2003                              2003.09.18 - SLIDE 51
The Indexing Process
• Concept identification
• Term selection (via thesaurus)
• Term assignment




IS 202 – FALL 2003                 2003.09.18 - SLIDE 52
  Application: The Indexing Process (Manual)

                                                                                                         Select Alternative    NO
             Examine Document                                      Does                                  term to represent
                                        Consider                                  Can Concept
                and Identify                                     Thesaurus     NO                            Concept
Start                                     First                                   be expressed
                Significant                                     contain term
                                        Concept                                    combining
                 Concepts                                           for
                                                                                     terms?                                        Is
                                                                  Concept                                 Establish Term
                                                                        YES                                                      Term
                                                                                                            Denoting
                              YES                                                                                               suitable
                                                                                                 NO          Concept
                                         Select            NO    Preferred
                                        Preferred
                           Is                                     Term?
                NO                        Term
                         There
End
                        Another
                        Concept                                          YES              YES                         YES



                                                                  Consider
                                                                                    Consider Each of        Admit New Term
                                                                  Preferred
                                                                                     These Terms             Into Thesaurus
                       Assign Terms                                 Term
                            to
                        Document            NO


                                                       Would
                                                     Concept be
                          Prefer      YES                                             Consider any
                                                 better represented
                        Alternative                                                associated terms in
                                                      by one of
                         Term(s)                                                   Thesaurus (NT,BT)
                                                        these
                                                        terms
                                                                                  Adapted from ISO 5963, p.5




  IS 202 – FALL 2003                                                                                              2003.09.18 - SLIDE 53
Thesaurus Revision and Updates

• There will always be new concepts,
  products, or expressions that need to be
  added to the thesaurus
       – Set a regular schedule of reviews and
         revisions
       – Collect complaints, problems, etc. and fold
         into revision of the thesaurus




IS 202 – FALL 2003                              2003.09.18 - SLIDE 54
References
• Soegel, D. Indexing Languages and Thesauri:
  Construction and Maintenance. Los Angeles:
  Melville Publishing Co., 1974
• Foskett, A.C. The Subject Approach to
  Information. London: Clive Bingley, 1982.
• Standards:
       – ANSI/NISO z39.19-1994 — American National Standard
         Guidelines for the Construction, Format and Management of
         Monolingual Thesauri
       – ANSI/NISO Draft Standard Z39.4-199x — American National
         Standard Guidelines for Indexes in Information Retrieval
       – ISO 2788 — Documentation — Guidelines for the establishment
         and development of monolingual thesauri
       – ISO 5964 — Documentation — Guidelines for the establishment
         and development of multilingual thesauri


IS 202 – FALL 2003                                         2003.09.18 - SLIDE 55
Lecture Overview
• Review
       – Types of Controlled Vocabularies
       – Name Authority Control
• Thesaurus Design and Development
       – Controlled Vocabularies for topical description
       – Thesaurus Design
       – Steps In Thesaurus Development
       – Indexing
• Discussion
IS 202 – FALL 2003                              2003.09.18 - SLIDE 56
Simon King - Soergel
• Indexing Languages and Thesauri: Construction
  and Maintenance was published in 1974 and
  technology has clearly changed a lot since then.
  Do Soergel’s careful step-by-step instructions
  (often referencing use of index cards) still have
  much value? Are outdated instructions and
  flowcharts in fact dangerous to current thesaurus
  builders, allowing them to avoid thinking too
  much about what they’re doing while mindlessly
  following directions?


IS 202 – FALL 2003                         2003.09.18 - SLIDE 57
Simon King - Soergel
• Are his concerns about storage (.e.g.
  attaching too many descriptors too a
  document) outdated? How about the fact
  that indexing gets more difficult as the size
  of the indexing language increases? Can
  modern technology help us here? Is there
  an upper limit on the size of a usable
  indexing language or should we apply
  Soergels’s guidelines for eliminating terms
  that point to few documents to make index
  languages as small as possible?
IS 202 – FALL 2003                      2003.09.18 - SLIDE 58
Simon King - Soergel
• Do some of the algorithms/procedures he
  describes (e.g. F04.2(e) and F0.4.4,1 on
  weighting documents when determining
  frequencies of concepts/terms) still make
  sense for when building more modern
  ISAR systems?




IS 202 – FALL 2003                   2003.09.18 - SLIDE 59
Sean Savage – House of Q.
• The House of Quality's foundation and
  starting point is a comprehensive and well-
  defined list of specific Customer Attributes
  (i.e., phrases that customers use to
  describe what's important to them in a
  product.) The CAs specify the "whats" that
  the team is aiming for and the team uses
  the CAs to determine Engineering
  Characteristics: the "hows" for reaching
  those targets.
IS 202 – FALL 2003                     2003.09.18 - SLIDE 60
Sean Savage – House of Q.
• The House of Quality seems to me a great
  technique for communicating about and
  improving mature products like
  automobiles and dishwashers, whose
  "whats" are stable and widely understood.
  What are we to make of this as we
  consider new and evolving technologies,
  where our biggest challenges often hinge
  upon discovery and definition of the
  "whats?"
IS 202 – FALL 2003                  2003.09.18 - SLIDE 61
Sean Savage – House of Q.
• With many such technologies, including
  phonecams, customers don't yet have a
  clear idea of specifically what's important
  to them, and in cases where they do state
  such opinions, those views often change
  dramatically as uses of the tools evolve
  and mature. Customers invent new uses
  for new tools, and in turn they develop
  new criteria for rating the tools.

IS 202 – FALL 2003                     2003.09.18 - SLIDE 62
Sean Savage – House of Q.
• Is The House of Quality irrelevant in the
  realm of very new technologies? Can we
  use cornerstones of The House to build
  more fluid team-communication tools that
  will allow more flexibility and change in the
  definition and measurement of the "whats"
  and that will clarify, rather than confuse,
  the relations between business functions
  and user satisfaction in evolving new
  products?
IS 202 – FALL 2003                      2003.09.18 - SLIDE 63
 Lisa de Larios-Heiman - Sano

• When designing a web page (or anything else),
  designers have to keep users' needs in mind.
  But is there a point where those needs should
  be ignored for the sake of creating clear,
  extensible designs? Is possible to be such a
  good designer, and to have enough expertise on
  the subject matter, that the designer knows what
  the users' need are without consulting them? For
  example, I disagree with Sano that international
  groupings of bands are irrelevant to people
  interested in music, but as a designer he does
  not recognize that as an important distinction

IS 202 – FALL 2003                        2003.09.18 - SLIDE 64
Lisa de Larios-Heiman - Sano

• This article was written at least 7 years
  ago, and is very dated in its technical
  references. Is it similarly dated in its
  discussion of the web design process? Or
  have the basics of web design not
  developed as rapidly as the technology,
  either because they don't need to or
  because designers aren't responding
  quickly to the technological changes? Did
  his dated references make you take his
  writings less seriously?
IS 202 – FALL 2003                   2003.09.18 - SLIDE 65
Next Time
• Multimedia Information Organization and
  Retrieval
• Readings/Discussion:
       – Editing Out Video Editing (Ryan Shaw)
       – Computational Media Aesthetics: Finding
         Meaning Beautiful (Margaret Spring)
       – The Holy Grail of Content-Based Media
         Analysis (Hong Qu)
       – Applications of Video-Content Analysis and
         Retrieval (Dan Perkel)
IS 202 – FALL 2003                            2003.09.18 - SLIDE 66

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:103
posted:9/20/2010
language:English
pages:66