Controlled Vocabularies Introduction - Courses by ktixcqlmc

VIEWS: 1 PAGES: 71

									Lecture 06: Controlled Vocabularies
Introduction


                       SIMS 202:
                Information Organization
                      and Retrieval
            Prof. Ray Larson & Prof. Marc Davis
                     UC Berkeley SIMS
         Tuesday and Thursday 10:30 am - 12:00 am
                         Fall 2002
                 Some slides in this lecture were developed by Prof. Marti Hearst
IS 202 - FALL 2002                                                        2002.09.12 - SLIDE 1
Lecture Contents
• Review
       – Dublin Core
       – Other Metadata Systems
• Controlled Vocabularies
• Name Authority Files
       – Choice of Names
       – Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
  Vocabularies
IS 202 - FALL 2002                   2002.09.12 - SLIDE 2
Lecture Contents
• Review
       – Metadata Systems
       – Dublin Core
• Controlled Vocabularies
• Name Authority Files
       – Choice of Names
       – Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
  Vocabularies
IS 202 - FALL 2002                   2002.09.12 - SLIDE 3
Metadata Systems and Standards

• Naming and ID systems – URLS, ISBNS
• Bibliographic description – MARC, Dublin
  Core, TEI, etc.
• Music – SMDL
• Images and objects – CIMI, VRA core
  categories
• Numeric data – DDI, SDSM
• Geospatial data – FGDC
• Collections – EAD

IS 202 - FALL 2002                   2002.09.12 - SLIDE 4
Dublin Core
• Simple metadata for describing internet
  resources
• For “Document-Like Objects”
• 15 Elements (in base DC)




IS 202 - FALL 2002                    2002.09.12 - SLIDE 5
Dublin Core Elements
•    Title                •   Format
•    Creator              •   Resource Identifier
•    Subject              •   Source
•    Description          •   Language
•    Publisher            •   Relation
•    Other Contributors   •   Coverage
•    Date                 •   Rights
•    Resource Type            Management

IS 202 - FALL 2002                         2002.09.12 - SLIDE 6
Title
• Label: TITLE

• The name given to the resource by the
  CREATOR or PUBLISHER




IS 202 - FALL 2002                   2002.09.12 - SLIDE 7
Author or Creator
• Label: CREATOR
• The person(s) or organization(s) primarily
  responsible for the intellectual content of
  the resource. For example, authors in the
  case of written documents, artists,
  photographers, or illustrators in the case of
  visual resources.



IS 202 - FALL 2002                      2002.09.12 - SLIDE 8
Subject and Keywords
• Label: SUBJECT
• The topic of the resource, or keywords or
  phrases that describe the subject or content of
  the resource. The intent of the specification of
  this element is to promote the use of controlled
  vocabularies and keywords. This element might
  well include scheme-qualified classification data
  (for example, Library of Congress Classification
  Numbers or Dewey Decimal numbers) or
  scheme-qualified controlled vocabularies (such
  as Medical Subject Headings or Art and
  Architecture Thesaurus descriptors) as well.

IS 202 - FALL 2002                          2002.09.12 - SLIDE 9
Description
• Label: DESCRIPTION
• A textual description of the content of the
  resource, including abstracts in the case of
  document-like objects or content descriptions in
  the case of visual resources. Future metadata
  collections might well include computational
  content description (spectral analysis of a visual
  resource, for example) that may not be
  embeddable in current network systems. In such
  a case this field might contain a link to such a
  description rather than the description itself.

IS 202 - FALL 2002                          2002.09.12 - SLIDE 10
Publisher
• Label: PUBLISHER
• The entity responsible for making the
  resource available in its present form,
  such as a publisher, a university
  department, or a corporate entity. The
  intent of specifying this field is to identify
  the entity that provides access to the
  resource.


IS 202 - FALL 2002                         2002.09.12 - SLIDE 11
Other Contributors
• Label: CONTRIBUTORS
• Person(s) or organization(s) in addition to
  those specified in the CREATOR element
  who have made significant intellectual
  contributions to the resource but whose
  contribution is secondary to the individuals
  or entities specified in the CREATOR
  element (for example, editors,
  transcribers, illustrators, and convenors).

IS 202 - FALL 2002                     2002.09.12 - SLIDE 12
Date
• Label: DATE
• The date the resource was made available in its
  present form. The recommended best practice is
  an 8 digit number in the form YYYYMMDD as
  defined by ANSI X3.30-1985. In this scheme, the
  date element for the day this is written would be
  19961203, or December 3, 1996. Many other
  schema are possible, but if used, they should be
  identified in an unambiguous manner.



IS 202 - FALL 2002                         2002.09.12 - SLIDE 13
Resource Type
• Label: RESOURCE TYPE
• The category of the resource, such as
  home page, novel, poem, working paper,
  preprint, technical report, essay,
  dictionary. It is expected that RESOURCE
  TYPE will be chosen from an enumerated
  list of types. One preliminary set of such
  types can be found at the following URL
  (now out of date):
     http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html

IS 202 - FALL 2002                                         2002.09.12 - SLIDE 14
Format
• Label: FORMAT
• The data representation of the resource, such as
  text/html, ASCII, Postscript file, executable application,
  or JPEG image. The intent of specifying this element is
  to provide information necessary to allow people or
  machines to make decisions about the usability of the
  encoded data (what hardware and software might be
  required to display or execute it, for example). As with
  RESOURCE TYPE, FORMAT will be assigned from
  enumerated lists such as registered Internet Media
  Types (MIME types). In principal, formats can include
  physical media such as books, serials, or other non-
  electronic media.

IS 202 - FALL 2002                                   2002.09.12 - SLIDE 15
Resource Identifier
• Label: IDENTIFIER
• String or number used to uniquely identify
  the resource. Examples for networked
  resources include URLs and URNs (when
  implemented). Other globally-unique
  identifiers,such as International Standard
  Book Numbers (ISBN) or other formal
  names would also be candidates for this
  element.

IS 202 - FALL 2002                    2002.09.12 - SLIDE 16
Source
• Label: SOURCE
• The work, either print or electronic, from
  which this resource is derived, if
  applicable. For example, an html encoding
  of a Shakespearean sonnet might identify
  the paper version of the sonnet from which
  the electronic version was transcribed.



IS 202 - FALL 2002                   2002.09.12 - SLIDE 17
Language
• Label: LANGUAGE
• Language(s) of the intellectual content of
  the resource. Where practical, the content
  of this field should coincide with the
  Z39.53 three character codes for written
  languages. See:
     http://www.sil.org/sgml/nisoLang3-1994.html




IS 202 - FALL 2002                                 2002.09.12 - SLIDE 18
Relation
• Label: RELATION
• Relationship to other resources. The intent of
  specifying this element is to provide a means to
  express relationships among resources that
  have formal relationships to others, but exist as
  discrete resources themselves. For example,
  images in a document, chapters in a book, or
  items in a collection. A formal specification of
  RELATION is currently under development.
  Users and developers should understand that
  use of this element should be currently
  considered experimental.

IS 202 - FALL 2002                          2002.09.12 - SLIDE 19
Coverage
• Label: COVERAGE
• The spatial locations and temporal
  duration characteristic of the resource.
  Formal specification of COVERAGE is
  currently under development. Users and
  developers should understand that use of
  this element should be currently
  considered experimental.


IS 202 - FALL 2002                  2002.09.12 - SLIDE 20
Rights Management
• Label: RIGHTS
• The content of this element is intended to be a
  link (a URL or other suitable URI as appropriate)
  to a copyright notice, a rights-management
  statement, or perhaps a server that would
  provide such information in a dynamic way. The
  intent of specifying this field is to allow providers
  a means to associate terms and conditions or
  copyright statements with a resource or
  collection of resources. No assumptions should
  be made by users if such a field is empty or not
  present.

IS 202 - FALL 2002                             2002.09.12 - SLIDE 21
Issues in Dublin Core
• Lack of guidance on what to put into each
  element
• How to structure or organize at the
  element level?
• How to ensure consistency across
  descriptions for the same persons, places,
  things, etc.



IS 202 - FALL 2002                   2002.09.12 - SLIDE 22
Metadata
• Structures and languages for the
  description of information resources and
  their elements (components or features)

• “Metadata is information on the
  organization of the data, the various data
  domains, and the relationship between
  them” (Baeza-Yates p. 142)


IS 202 - FALL 2002                     2002.09.12 - SLIDE 23
Metadata
• Often two main types of metadata are
  distinguished:
       – Descriptive metadata
              • Describes the information/data object and its
                properties
              • May use a variety of descriptive formats and rules
       – Topical metadata
              • Describes the topic or “aboutness” of an
                information/data object
              • May include a variety of vocabularies for
                describing, subjects, topics, categories, etc.

IS 202 - FALL 2002                                          2002.09.12 - SLIDE 24
Lecture Contents
• Review
       – Metadata Systems
       – Dublin Core
• Controlled Vocabularies
• Name Authority Files
       – Choice of Names
       – Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
  Vocabularies
IS 202 - FALL 2002                  2002.09.12 - SLIDE 25
Controlled Vocabularies
• Vocabulary control is the attempt to
  provide a standardized and consistent set
  of terms (such as subject headings,
  names, classifications, etc.) with the intent
  of aiding the searcher in finding
  information
• That is, it is an attempt to provide a
  consistent set of descriptions for use in (or
  as) metadata

IS 202 - FALL 2002                      2002.09.12 - SLIDE 26
Controlled Vocabularies
•    Names and name authorities
•    Gazetteers (geographic names)
•    Code lists (e.g., LC language codes)
•    Subject heading lists
•    Classification schemes
•    Thesauri




IS 202 - FALL 2002                      2002.09.12 - SLIDE 27
Lecture Contents
• Review
       – Metadata Systems
       – Dublin Core
• Controlled Vocabularies
• Name Authority Files
       – Choice of Names
       – Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
  Vocabularies
IS 202 - FALL 2002                  2002.09.12 - SLIDE 28
Names
• Remember Cutter’s objectives of
  bibliographic description?
       – To enable a person to find a document of
         which the author is known
       – To show what the library has by a given
         author
• First serves access
• Second serves collocation


IS 202 - FALL 2002                            2002.09.12 - SLIDE 29
Problems with Names
• How many names should be associated
  with a document?
• Which of these should be the “main
  entry?”
• What form should each of the names
  take?
• What references should be made from
  other possible forms of names that haven’t
  been used?

IS 202 - FALL 2002                   2002.09.12 - SLIDE 30
The Problem
• Proliferation of the forms of names
       – Different names for the same person
       – Different people with the same names


• Examples
       – from Books in Print (semi-controlled but not
         consistent)
       – ERIC author index (not controlled)



IS 202 - FALL 2002                              2002.09.12 - SLIDE 31
Goethe




                     …etc…



IS 202 - FALL 2002     2002.09.12 - SLIDE 32
John Muir




IS 202 - FALL 2002   2002.09.12 - SLIDE 33
Pauline Cochrane nee Atherton




IS 202 - FALL 2002              2002.09.12 - SLIDE 34
Pauline Cochrane nee Atherton




IS 202 - FALL 2002              2002.09.12 - SLIDE 35
Rules for Description
• AACR II and other sets of descriptive
  cataloging rules provide guidelines for:
       – Determining the number of name entries
       – Choosing a main entry
       – Deciding on the form of name to be used
       – Deciding when to make references




IS 202 - FALL 2002                            2002.09.12 - SLIDE 36
Authority Control
• Authority control is concerned with
  creation and maintenance of a set of terms
  that have been chosen as the standard
  representatives (also know as established)
  based on some set of rules
• If you have rules, why do you need to
  keep track of all of the headings? Can’t
  you just infer the headings from the rules?


IS 202 - FALL 2002                    2002.09.12 - SLIDE 37
Conditions of Authorship?
• Single person or single corporate entity
• Unknown or anonymous authors
       – Fictitiously ascribed works
• Shared responsibility
• Collections or editorially assembled works
• Works of mixed responsibility (e.g.,
  translations)
• Related works

IS 202 - FALL 2002                     2002.09.12 - SLIDE 38
Added Entries
• Personal names
       –    Collaborators
       –    Editors, compilers, writers
       –    Translators (in some cases)
       –    Illustrators (in some cases)
       –    Other persons associated with the work (such as the
            honoree in a festschrift)
• Corporate names
       – Any prominently named corporate body that has
         involvement in the work beyond publication,
         distribution, etc.

IS 202 - FALL 2002                                     2002.09.12 - SLIDE 39
Choice of Name
• AACR II says that the predominant form of
  the name used in a particular author’s
  writings should be chosen as the form of
  name

• References should be made from the
  other forms of the name



IS 202 - FALL 2002                  2002.09.12 - SLIDE 40
Form of the Name
• When names appear in multiple forms,
  one form needs to be chosen
• Criteria for choice are:
       – Fullness (e.g., full names vs. initials only)
       – Language of the name
       – Spelling (choose predominant form)
• Entry element:
       – John Smith or Smith, John?
       – Mao Zedong or Zedong, Mao? (Mao Tse
         Tung?)

IS 202 - FALL 2002                                 2002.09.12 - SLIDE 41
Name Authority Files
              ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242
               KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80
               RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
               VST:d 08-21-91              Other Versions: earlier
               040 DLC$cDLC$dDLC$dOCoLC
               053 PR6005.R517
               100 10 Creasey, John
               400 10 Cooke, M. E.
               400 10 Cooke, Margaret,$d1908-1973   Different names for the
               400 10 Cooper, Henry St. John,$d1908-1973
               400 00 Credo,$d1908-1973             same person
               400 10 Fecamps, Elise
               400 10 Gill, Patrick,$d1908-1973
               400 10 Hope, Brian,$d1908-1973
               400 10 Hughes, Colin,$d1908-1973
               400 10 Marsden, James
               400 10 Matheson, Rodney
               400 10 Ranger, Ken
               400 20 St. John, Henry,$d1908-1973
               400 10 Wilde, Jimmy
               500 10 $wnnnc$aAshe, Gordon,$d1908-1973



IS 202 - FALL 2002                                                          2002.09.12 - SLIDE 42
 Name Authority Files
ID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048
  KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91
  RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
  VST:d 08-19-91
  040 OCoLC$cOCoLC
  100 10 Marric, J. J.,$d1908-1973
  500 10 $wnnnc$aCreasey, John
  663 Works by this author are entered under the name used in the item. For
     a listing of other names used by this author, search also under$bCrease
    y, John
  670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J
    .J. Marric)
  670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric)
  670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis
    h author; pseud.: Marric, J. J.)


 IS 202 - FALL 2002                                            2002.09.12 - SLIDE 43
 Name Authority Files

ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124
  KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81
  RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
  VST:d 06-06-91               Other Versions: earlier
  040 DLC$cDLC$dDLC$dOCoLC
  100 10 Butler, William Vivian,$d1927-
  400 10 Butler, W. V.$q(William Vivian),$d1927-
  400 10 Marric, J. J.,$d1927-
  670 His The durable desperadoes, 1973.
  670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler)
  670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J
    .J. Marric)

                      Different people writing with the same name

 IS 202 - FALL 2002                                             2002.09.12 - SLIDE 44
The Haunting of Lauran Paine
                           Batchelor, Reg.         Kelly, Ray.
1.   Paine, Lauran.        Beck, Harry.
     ALSO KNOWN AS:                                Ketchum, Jack.
                           Bedford, Kenneth.
      Carrel, Mark.        Bosworth, Frank.        Liggett, Hunter.
      Thompson, Russ.      Bovee, Ruth.            Lucas, J. K.
      Andrews, A. A.       Cassidy, Claude.        Lyon, Buck.
      Benton, Will.        Custer, Clint.          Morgan, Arlene.
      Bradford, Will.      Dana, Amber.            Morgan, Valerie.
      Bradley, Concho.     Dana, Richard.          O'Connor, Clint.
      Brennan, Will.       Davis, Audrey.
      Carter, Nevada.                              St. George, Arthur.
                           Drexler, J. F.          Sharp, Helen.
      Allen, Clay.         Duchesne, Antoinette.
      Almonte, Rosa.                               Thorn, Barbara.
                           Fisher, Margot.
      Armour, John.        Fleck, Betty.           Archer, Dennis.
      Cassady, Claude.     Frost, Joni.            Clark, Badger.
      Glendenning, Donn.   Gordon, Angela.
      Kelley, Ray.         Gorman, Beth.
      Kilgore, John.       Hayden, Jay.
      Martin, Tom.         Houston, Will.
      Slaughter, Jim.      Howard, Troy.
      Standish, Buck.      Ingersol, Jared.
      …                    …
IS 202 - FALL 2002                                     2002.09.12 - SLIDE 45
Some Interesting Ones…




IS 202 - FALL 2002       2002.09.12 - SLIDE 46
Lecture Contents
• Review
       – Dublin Core
       – Other Metadata Systems
• Controlled Vocabularies
• Name Authority Files
       – Choice of Names
       – Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
  Vocabularies
IS 202 - FALL 2002                  2002.09.12 - SLIDE 47
    Structure of an IR System
                                                                                                                         Storage
            Interest profiles                                                                     Documents              Line
Search         & Queries                                                                            & data
Line                            Information Storage and Retrieval System



                                        Rules of the game =
                                     Rules for subject indexing +
         Formulating query in        Thesaurus (which consists of                                  Indexing
              terms of                                                                          (Descriptive and
             descriptors                        Lead-In                                             Subject)
                                              Vocabulary
                                                  and
                                               Indexing
                                              Language
             Storage of
                                                                                                   Storage of
               profiles
                                                                                                   Documents




         Store1: Profiles/                   Comparison/                                       Store2: Document
         Search requests                      Matching                                          representations




                                               Potentially
                                                Relevant
                                               Documents                   Adapted from Soergel, p. 19




    IS 202 - FALL 2002                                                                                   2002.09.12 - SLIDE 48
Uses of Controlled Vocabularies

• Library subject headings, classification,
  and authority files
• Commercial journal indexing services and
  databases
• Yahoo, and other web classification
  schemes
• Online and manual systems within
  organizations
       – SunSolve
       – MacArthur

IS 202 - FALL 2002                   2002.09.12 - SLIDE 49
Types of Indexing Languages
• Uncontrolled keyword indexing
• Indexing languages
       – Controlled, but not structured
• Thesauri
       – Controlled and structured
• Classification systems
       – Controlled, structured, and coded
• Faceted thesauri and classification
  systems
IS 202 - FALL 2002                           2002.09.12 - SLIDE 50
Indexing Languages
• An index is a systematic guide designed to
  indicate topics or features of documents in
  order to facilitate retrieval of documents or
  parts of documents

• An Indexing language is the set of terms
  used in an index to represent topics or
  features of documents, and the rules for
  combining or using those terms

IS 202 - FALL 2002                      2002.09.12 - SLIDE 51
Indexing Languages
• Library of Congress Subject Headings
• Yellow pages topics
• Wilson indexes (“reader’s guide”)




IS 202 - FALL 2002                  2002.09.12 - SLIDE 52
Thesauri
• A thesaurus is a collection of selected
  vocabulary (preferred terms or descriptors)
  with links among
       – Synonymous
       – Equivalent
       – Broader
       – Narrower, and
       – Other related terms



IS 202 - FALL 2002                    2002.09.12 - SLIDE 53
Thesauri (Cont.)
• National and international standards for thesauri
       – ANSI/NISO z39.19 -- 1994 -- American National
         Standard Guidelines for the Construction, Format and
         Management of Monolingual Thesauri
       – ANSI/NISO Draft Standard Z39.4-199x -- American
         National Standard Guidelines for Indexes in
         Information Retrieval
       – ISO 2788 -- Documentation -- Guidelines for the
         establishment and development of monolingual
         thesauri
       – ISO 5964 -- Documentation -- Guidelines for the
         establishment and development of multilingual
         thesauri

IS 202 - FALL 2002                                  2002.09.12 - SLIDE 54
Thesauri (Cont.)
• Examples:
       – The ERIC Thesaurus of Descriptors
       – The Art and Architecture Thesaurus
       – The Medical Subject Headings (MESH) of the
         National Library of Medicine




IS 202 - FALL 2002                          2002.09.12 - SLIDE 55
Classification Systems
• A classification system is an indexing
  language often based on a broad ordering
  of topical areas
• Thesauri and classification systems both
  use this broad ordering and maintain a
  structure of broader, narrower, and related
  topics
• Classification schemes commonly use a
  coded notation for representing a topic
  and it’s place in relation to other terms

IS 202 - FALL 2002                    2002.09.12 - SLIDE 56
Classification Systems (Cont.)
• Examples:
       – The Library of Congress Classification
         System
       – The Dewey Decimal Classification System
       – The ACM Computing Reviews Categories
       – The American Mathematical Society
         Classification System




IS 202 - FALL 2002                          2002.09.12 - SLIDE 57
Using Controlled Vocabulary
• Start with the text of the document
• Attempt to “control” or regularize:
       – The concepts expressed within
              • mutually exclusive
              • exhaustive
       – The language used to express those concepts
              • limit the normal linguistic variations
              • regulate word order and structure of phrases
              • reduce the number of synonyms or near-synonyms
• Also, provide cross-references between
  concepts and their expression
(These slides follow Bates 88)                        Slide author: Marti Hearst

IS 202 - FALL 2002                                    2002.09.12 - SLIDE 58
Classification Schemes
• Classify possible concepts.
• Goals:
       – Completely distinct conceptual categories
         (mutually exclusive)
       – Complete coverage of conceptual categories
         (exhaustive)




                                             Slide author: Marti Hearst

IS 202 - FALL 2002                           2002.09.12 - SLIDE 59
Assigning Headings vs. Descriptors



 • Subject headings               • Descriptors
        – Assign one (or a few)     – Mix and match
          complex heading(s)
          to the document



         How would we describe recipes using
         each technique?
                                                  Slide author: Marti Hearst

IS 202 - FALL 2002                                2002.09.12 - SLIDE 60
Subject Heading vs. Descriptors

• Wilsonline                           • ERIC
       – Athletes                        –   Athletes
       – Athletes -- Heath&hygiene       –   Athletic Coaches
       – Athletes -- Nutrition
                                         –   Athletic Equipment
       – Athletes -- Physical Exams
                                         –   Athletic Fields
       – …
                                         –   Athletics
       – Athletics
       – Athletics -- Administration     –   …
       – Athletics -- Equipment --       –   Sports Psychology
         Catalogs                        –   Sportsmanship
       – …
       – Sports -- Accidents and
         Injuries
       – Sports -- Accidents and
         Injuries -- Prevention
                                                             Slide author: Marti Hearst

IS 202 - FALL 2002                                           2002.09.12 - SLIDE 61
Subject Headings vs. Descriptors

• Describe the contents   • Describe one concept
  of an entire document     within a document
• Designed to be          • Designed to be used
  looked up in an           in Boolean searching
  alphabetical index        – Combine to describe
    – Look up document        the desired document
      under its heading
• Few (1-5) headings      • Many (5-25)
  per document              descriptors per
                            document

                                              Slide author: Marti Hearst

IS 202 - FALL 2002                            2002.09.12 - SLIDE 62
Lecture Contents
• Review
       – Dublin Core
       – Other Metadata Systems
• Controlled Vocabularies
• Name Authority Files
       – Choice of Names
       – Form of Names
• Other Types of Controlled Vocabularies
• Faceted vs. Hierarchic Organization of
  Vocabularies
IS 202 - FALL 2002                  2002.09.12 - SLIDE 63
Hierarchical Classification
• Each category is successively broken
  down into smaller and smaller subdivisions
• No item occurs in more than one
  subdivision
• Each level divided out by a “character of
  division” (also known as a feature)
       – Example:
              • Distinguish “Literature” based on:
                     – Language
                     – Genre
                     – Time Period
                                                     Slide author: Marti Hearst

IS 202 - FALL 2002                                   2002.09.12 - SLIDE 64
Hierarchical Classification

                                Literature


                     English     French        Spanish     ...



      ... Prose Poetry Drama ... Prose Poetry Drama ...


                                                          ...

       16th 17th 18th 19th     16th 17th 18th 19th



                                                                Slide author: Marti Hearst

IS 202 - FALL 2002                                              2002.09.12 - SLIDE 65
Labeled Categories for Hierarchical
Classification

• LITERATURE
       – 100 English Literature
              • 110 English Prose
                     –   English Prose 16th Century
                     –   English Prose 17th Century
                     –   English Prose 18th Century
                     –   ...
              • 111 English Poetry
                     – 121 English Poetry 16th Century
                     – 122 English Poetry 17th Century
                     – ...
              • 112 English Drama
                     – 130 English Drama 16th Century
                     – …
       – 200 French Literature
                                                         Slide author: Marti Hearst

IS 202 - FALL 2002                                       2002.09.12 - SLIDE 66
Faceted Classification
• Create a separate, free-standing list for
  each characteristic or division (feature)

• Combine features to create a classification




                                       Slide author: Marti Hearst

IS 202 - FALL 2002                     2002.09.12 - SLIDE 67
Faceted Classification Along With Labeled
Categories

• A Language                 • Aa English Literature
       – a English
       – b French            • AaBa English Prose
       – c Spanish
• B Genre                    • AaBaCa English Prose
       – a Prose               16th Century
       – b Poetry
       – c Drama             • AbBbCd French Poetry
• C Period                     19th Century
       –    a 16th Century
       –    b 17th Century   • BbCd Drama 19th
       –    c 18th Century     Century
       –    d 19th Century
                                               Slide author: Marti Hearst

IS 202 - FALL 2002                             2002.09.12 - SLIDE 68
Important Questions
• How to use both types of classification
  structures?

• How to look through them?

• How to use them in search?



                                      Slide author: Marti Hearst

IS 202 - FALL 2002                    2002.09.12 - SLIDE 69
Next Time
• Multimedia Information Organization and
  Retrieval (MED)

• Readings for next time (in Protected)
       – “Indexing the Content of Multimedia
         Documents” (S. W. Smoliar, L. D. Wilcox)
       – “Computational Media Aesthetics: Finding
         Meaning Beautiful” (C. Dorai, S. Venkatesh)
       – “The Holy Grail of Content-Based Media
         Analysis” (S. Chang)

IS 202 - FALL 2002                            2002.09.12 - SLIDE 70
Homework (!)
• Do Readings

• Receive and integrate feedback on
  Assignment 2 to iterate your Photo Use
  Scenario (nothing to turn in on this yet)

• Assignment 3: Photo Metadata Design
       – Due by Thursday, September 19


IS 202 - FALL 2002                       2002.09.12 - SLIDE 71

								
To top