The Subject Indexing Process by pengxiang


									Controlled vs. free vocabularies
            Indexing languages
   Subject indexing process
   Analysis of a document
   Indexing exercise
   Performance of an information system
   Pre-coordinate and post-coordinate
   Controlled vocabularies
   Indexing failures

   Indexing – the process whereby indexes
    and associated tools for the organization
    of knowledge are created
   Effective & efficient indexing – involves
    skill and judgment in the assignment of
   There are three stages:

   the indexer becomes conversant with the
    subject content of the document to be
   the indexer attempts to identify the
    concepts that are represented by the
    words in the document
   the indexer must examine the
    document’s content, concentrating
    particularly on the clues offered by the
    title, the contents page, chapter headings
    and any abstracts, introduction,
   the identification of the concepts within a
    document which are worthy of indexing
    ◦ usually it is possible to identify a central theme
    ◦ to what extent should access be provided to
      secondary topics considered within a document?
      (TGM example)
    ◦ traditional approaches have sought to find an
      indexing term which is co-extensive with the
      content of the document (i.e. the scope of the
      term & the document match)

       3 Questions the Indexer must ask:
    ◦      What is the document about? (ideal = read
           entire item and pick central theme)
    ◦      Why has it been added to our collection?
    ◦      What aspects will interest our users?
       Key = no single set of “correct” terms = depends on
        audience and collection
       And – the more specialized the clientele, the more likely
        it is that the index can be tailored to their needs (i.e.
        highly specific)

   Exhaustive indexing – how many themes
    will be included?
   Specificity – always index at the level of
    specificity of the document
    ◦ E.g. an article on cultivating oranges is indexed
      under “oranges”, not “citrus fruit” (and not both
      – only assign one term for each concept)
    ◦ What if the controlled vocabulary doesn’t
      include the term “oranges”? Use the most
      specific term you can use (“citrus fruit” not
   In practice – specificity may be achieved by
    using term combinations (e.g. Canadian
    Libraries = “Canada” & “Libraries)

   Read the article, focusing on one
    paragraph at a time
    ◦ What is the writer saying?
    ◦ What are the concepts – or are there any?
    ◦ How does the paragraph reward the reader – are
      there important ideas here?
    ◦ Write down words/phrases for the concepts that
      come to mind
    ◦ Index the document using “natural language”

   having identified the central theme of a
    document, this theme must be
    described in terms present in the
    indexing language
       in controlled language indexing, this
        involves using the thesaurus to
        assign terms to the document
       key = select terms and relationships
        that are consistent with the “typical”
        user’s perspective on the subject (i.e.
        user warrant = so that the indexing
        system is tailored to the needs of the
        users of the index)
1.   Indexing accuracy
        Indexers have control over accuracy
2.   Indexing policy
        Outside the indexer’s control
        Major policy decision = exhaustivity –
         the terms assigned may represent the
         subject matter of the document
         completely or they may be selective
        E.g. “most items will be indexed with
         8 to 15 terms”

   Precision = ratio of useful items to total
    retrieved = # of relevant records
                  # of records retrieved
   Recall = extent to which all useful items
    are found, from the total in the database
              # of relevant records retrieved
           # of relevant records in the database
   E.g. 100 relevant records in the database
         80 records retrieved that are relevant
        200 records retrieved in total
              Recall = 80/100 = 80%
    Precision = 80/200 = 40% = lots of “junk”
1.       Include all topics known to be of interest
         to the users of the information service that
         are treated substantively in the document
     ◦     Ask yourself – how much information is given
           on the topic in the article? How much interest
           will users have in the topic? How much
           information already exists on the topic?

2.       Index each of these as specifically as the
         vocabulary of the system allows and the
         needs or interests of the users warrants

   Allow a searcher to combine terms in any
   the multidimensionality of the
    relationships among terms is retained
   every term assigned to a document has
    equal weight – one is no more important
    than another

   Are not as flexible as postcoordinate
   The multidimensionality of the
    relationships among terms is difficult to
   Terms can only be listed in a particular
    sequence, which implies that the first
    term is more important than the others
   It is not easy to combine terms at the
    time a search is performed
       E.g. LCSH
        Mozambique – Economic Relations – South

   an authority list = indexers can only assign
    to a document terms that appear on the
    list approved by the organization for which
    they work

   Moves the responsibility from the user (i.e.
    through free-text searching) to the indexer
    (through the creation/use of controlled

a)   controlled access to each concept (i.e.
     consistent representation of the term)
b)   the creation of hierarchies (broader,
     narrower terms) show relationships
     between terms
c)   major and minor descriptors are used
     to represent the document at hand
d)   controlled access for plurals, acronyms,
e)   homonyms are controlled = same word
     relates to different concepts (e.g.
     “pitch” = music vs. baseball) = can
     control for these differences

    The Institute for Scientific Information (ISI)
     publishes three citation indexes:
1.   the Science Citation Index
2.   the Social Sciences Citation Index
3.   the Arts and Humanities Citation Index

ISI Web of Science   19
ISI Web of Science   20
   Most subject indexing = binary decision = a
    term is either assigned to a document or it
    is not

   Some indexes provide for a weighting of
    terms on a numeric scale, or use “major” or

     Example: Psychological research of
    computer mediated communication in

    Conceptual analysis failures:
    1. Failure to recognize a topic that is of
       potential interest to the user group served
    2. Misinterpretation of what some aspect of
       the document really deals with, leading to
       an assignment of terms that are
    Translation failures:
    1. Failure to use the most specific term
       available to represent some subject
    2. Use of a term that is inappropriate to the
       subject matter because of lack of subject
       knowledge or due to carelessness

1.   The indexer contravenes policy
2.   The indexer fails to use the vocabulary
     elements in the way they should be used
3.   The indexer fails to use a term at the
     correct level of specificity
4.   The indexer uses an obviously incorrect
     term, perhaps through a lack of subject
5.   The indexer omits an important term


To top