Document Sample
6a Powered By Docstoc
					Module 6a: Intro to Controlled Vocabularies,
      Taxonomies and Classification

  IMT530: Organization of Information Resources
  Winter 2007
  Michael Crandall
                           Module 6a Outline
•     Where we are
•     Controlled vocabularies
•     Types of controlled vocabularies
•     Tagging
•     Overview of building vocabularies

    IMT530- Organization of Information Resources   2
• We looked at the indexing process to
  see how controlled vocabularies can be
  used to enhance access to information
    – Different methods of indexing provide
      different results
    – Need to decide on your approach based on
      an analysis of your business objectives, the
      user needs, and the domain
    – A combination of automatic and human
      indexing is often the best solution
IMT530- Organization of Information Resources           3
                        Overview of Subject
• Subject analysis
   – a technique used to determine the “subject(s)” and
     disciplinary context exemplified by an object
• Subject indexing
   – a technique through which subject terms (words,
     taxonomic categories, or notation) are added to an
     object representation to describe the subject
     content of the object
• Controlled vocabularies
   – standards containing controlled subject terms
     (words, taxonomic categories, or notation) used in
     the indexing process

 IMT530- Organization of Information Resources            4
Controlled Vocabulary: Definition
• A controlled vocabulary is a list of terms
  (words or phrases) or codes (notation)
  used for indexing
• Almost always, controlled vocabularies
  show relationships among terms

 IMT530- Organization of Information Resources   5
                     Purpose of Controlled
• Specific Purposes
     – To provide access to content by subject, through
       providing hierarchical and associative relationships
       and synonym control for the terms used in the
     – Increase precision in retrieval and display by
       controlling homographs (words that are spelled the
       same but have different meanings)
• General Purposes
     – Assist users by conveying meaning, orientation,
       and structure in a subject area
     – Assist users by providing rich relationships among
       concepts and terms

 IMT530- Organization of Information Resources              6
• Proposes five different vocabularies in
  any system:
    – Authors
    – Indexers
    – Syndetic structure
    – Searchers
    – Formulated queries
• Formal tradition vs. document tradition

IMT530- Organization of Information Resources        7
Types of Controlled Vocabularies

                                                    Zeng, M.L. (2005). Construction of controlled vocabularies: A primer.

•     Subject Heading List
•     Taxonomy
•     Thesauri
•     Classification Scheme
•     More terminology on Leonard Will’s site

    IMT530- Organization of Information Resources                                                                     8
                 Subject Heading Lists
• General list of terms (words and phrases), not limited
  by discipline or subject area
• Terms are called subject headings
• The distinction between thesauri & subject heading
  lists is largely historical (subject heading lists are
  older); there are very few subject heading lists
  because they are so expensive to maintain
• Terms are mainly subject attributes, but there are
  many exemplified attributes used in subdivisions
• Example: Library of Congress Subject Headings
  (LCSH), used in library catalogs
     – Sample terms: “France – Colonies – History – 18th century”;
       “Time and space – Juvenile fiction”; “Frogs” (notice the use
       of subdivisions, marked here by dashes; thesauri seldom use

 IMT530- Organization of Information Resources                    9
• List of terms (words and phrases) that may be general
  or subject/discipline/domain specific
• Terms are called taxons or (simply) terms
• Terms represent subjects, disciplines/domains, and
  exemplified attributes
• Used in digital environment only
• Examples: Microsoft Corporation intranet
  taxonomies; Yahoo taxonomy used in the Yahoo
     – Sample terms from the Yahoo taxonomy (in Yahoo, you’ll find
       these at the top of the screen as you browse through the
       directory): “Education”; “Science > Agriculture > Research >
       Government Agencies”; “Health > Nursing”; “Health >

 IMT530- Organization of Information Resources                   10
• Thesauri (pl.) / Thesaurus (s.)
     – List of terms (words and phrases) that are usually
       limited to a specific subject or disciplinary area
     – Terms listed in a thesaurus are often called
     – Thesauri were mostly defined and developed after
       the advent of the computer and were created for
       use in an computerized environment (or with
       computers in mind)
     – Terms are usually subject (about) attributes, but
       some thesauri also contain exemplified (example
       of) attributes-
     – Example: ERIC Thesaurus (education)
             • Sample terms from the ERIC Thesaurus: “School
               community relationship”; “College entrance exams”; “Age
               grade placement”
 IMT530- Organization of Information Resources                          11
          “Classification” Schemes
• Chart of subject categories contextualized by
  a hierarchical structure
• Terms are lists of codes (notation)
• Terms are called classes and class numbers
• Classification schemes make use of
  disciplinary, subject, and (sometimes)
  exemplified attributes
• Used often to arrange physical documents;
  sometimes used in online environments

 IMT530- Organization of Information Resources    12
            “Classification” Example
• Examples: Dewey Decimal Classification
  (DDC); Universal Decimal Classification
  (UDC); Colon Classification
• Sample entries (DDC):
     – 510 (meaning: “Mathematics” (a discipline and a
     – 512.57 (meaning: “Mathematics / Linear,
       multilinear, multidimensional algebras / Factor
     – 362.582 (meaning: “Social problems and services
       / Problems of and services to the poor / Financial

 IMT530- Organization of Information Resources          13
    Four Types of Classification
• Kwasnik describes four classification systems
     –    Hierarchies
     –    Trees
     –    Paradigms
     –    Facets
• Paradigms are useful primarily for analysis of subject
  gaps and relationships in a constrained space
• Trees are a poor form of hierarchy with limited
• We’ll look at the other two in some detail over the next
  two weeks

 IMT530- Organization of Information Resources           14
• Good for representation of knowledge in
  mature domains where the nature of the
  entities and relationships are well known
• You’ll see examples of these in the thesauri
  that we will look at in today’s exercise
• Require a model that describes what entities
  are included, with rules of association and
• Tend to be monolithic and cumbersome for
  large domains

 IMT530- Organization of Information Resources      15
• Actually a different approach rather than
  a different structure
    – May use hierarchies or trees as part of the
    – Originated in the work of S.R. Ranganathan
            • Proposed that any object could be viewed in
              five ways: personality, matter, energy, space
              and time (PMEST)
    – Being used more and more in modern
      information systems because of flexibility in
      meeting multiple needs
IMT530- Organization of Information Resources                 16
                Collaborative Tagging
• Points out issues of “basic level” and “collective
• Tug of war between personal storage
     – Identifying qualities
     – Self reference
     – Task organizing
• and public nature of access
     –    What or who it is about
     –    What it is
     –    Who owns it
     –    Categories
• Stability emerges from imitation and shared

 IMT530- Organization of Information Resources         17
                               Trees vs. Tags
• Weinberger’s article postulates three types of
     – Trees (hierarchies)
     – Facets
     – Tags
• Golder/Huberman and Weinberger both point
  out that each approach can be useful in
  particular situations
     – Choosing your approach is part of the process of
       subject and domain analysis

 IMT530- Organization of Information Resources            18
          Steps in Constructing CVs
• Define your domain
• Gather concepts
        – From user interviews, search logs, content
          analysis, preexisting vocabularies
•     Select your approach
•     Extract terminology
•     Control your terms
•     Organize your terms
•     Maintain, maintain, maintain

    IMT530- Organization of Information Resources      19
• If not, take a break!!!

 IMT530- Organization of Information Resources     20
                                      Exercise 6a
• Purpose is to explore some existing controlled
  vocabularies to investigate their differences
  and similarities, how useful they might be for
  subject access, and to become familiar with
  the structure of controlled vocabularies in
• Spend the next 45 minutes on Exercise 6a
• Ask questions and talk!!!
• Be sure to hand in completed work at the end
  of class for credit!!!

 IMT530- Organization of Information Resources      21
• We’ll start to look at ways to build
  controlled vocabularies and the rules
  associated with them
• Remember to read assignments
  BEFORE class

IMT530- Organization of Information Resources        22

Shared By: