Codifying Semantic Information in Medical Questions Using Lexical by mikesanye


									  Codifying Semantic
Information in Medical
Questions Using Lexical
      Paul E. Pancoast
      Arthur B. Smith
       Chi-Ren Shyu
               Research Purpose
   To find a method for classifying medical
    questions that are asked by clinicians

   Hypothesis - Simply indexing by keywords isn‟t
    enough to
       distinguish questions with different meanings but
        similar wording, or to
       group questions with similar meanings but different
   Semantic Information – the meaning of the
   Syntactic Information – the parts of speech of
    the words (word type, sentence part)
   Medical Questions – a question asked by a
   Lexical Sources – sources of words and
   UMLS – Unified Medical Language System
   Ambitious project of the National Library of
    Medicine, begun in 1986
   Help researchers retrieve and integrate
    electronic biomedical information from a variety
    of sources
   Links over 100 controlled vocabularies
   Assigns unique identifiers to medical concepts
    and strings
   Maps the hierarchical relationships between the
    medical concepts
                  Why Bother?
         (To classify medical questions?)

   Clinicians have questions when treating patients
   Researchers have gathered collections of these
   No good method exists to classify the questions
   How many times has a particular question been
   Which questions should receive priority for
    evidence-based answers?
   What is the best way to treat acute pharyngitis?
   How should I approach a patient with a sore

   What should I do with a patient with diabetes
    and insulin resistance?
   What should I do with a patient with diabetes
    who is resistant to taking insulin?
             Source Questions
   American researcher – observed clinicians
    at work
   British researchers – questions sent in by
    clinicians – answered by researchers
   Australian researchers – questions sent in
    by clinicians – answered by researchers

   4083 total questions
                    Source Vocabulary
   MRCON – a table from the Metathesaurus
       Lists the medical concepts by unique identifiers (CUI)
        and each string associated with a concept
       unique (string => 1 concept)
       ambiguous (string => 2+ concepts)
            COLD – ambient temperature, viral respiratory infection,
             chronic obstructive lung disease
       2,247,454 strings associated with concepts
   Non-medical Lexicon – from Roget‟s Thesaurus
       Query objects (why, when, how), identifiers (I, you,
        he), modifiers (soon, frequently)
       749 terms in this lexicon
                 String Matching
   Parsing program (written in C)
   Separates individual questions into 3-word, 2-
    word, 1-word windows
   Matches the window against MRCON and our
   Generates a report of:
       Total number of words parsed
       Number of matches from unique, ambiguous, non-
        medical lists
       Strings that didn‟t match any of the lists
   String – individual word or words that matched
   Hits – how often the string was found
   Words – total number of matching words (some strings have more than one
    word in them)

                         Strings Hits          Words        %
         MRCON             4,534     24,844       30,186       42.3%
         MRCON                574     9,256        9,769       13.7%
         Non-                 208    16,768       17,783       24.9%
         Unmatched                    2,321       13,624       19.1%
   100 strings occurred 7850 times – or
    57.6% of the total matches
   712 strings => 3+ hits, 85% of all hits

   Our focus was on strings that didn‟t match
    one of the source vocabularies
       19.1% didn‟t match
       Hypothesis that additional terms not found in
        MRCON will be important for indexing
   Unmatched words – 2+ occurrences
                  Unique words       Total Number    Percent
Verb                          261             3676             31.7%
Noun                          186             2356             20.3%
Preposition                      9            2544             21.9%
Adj/Adv/Conj                  103             1095             9.5%
Mix *                          72              810             7.0%
Pronoun                        10              614             5.3%
Integer                        70              502             4.3%

    * can be more than one word type, depending on the context.
    Attacks, step, process all can be nouns or verbs

   MRCON – selected because of low
    rate of ambiguous string-CUI
     89% unique string matches
     11% ambiguous string matches

   Other tables have greater word
    coverage, but have more ambiguity
    for each of the words
   Our word-matching results were similar to
    other researchers
   Cimino matched 43% of words with Meta-1
    (we had 56% MRCON matches)
       Computers & Biomedical Research. Aug 1992;25(4):366-373.
   Hersh matched 60% of words to medical
    terminology & names dictionary
    (we had 79% combined lexicon matches)
       Proceedings/AMIA Annual Fall Symposium. p. 1997.

   Stop words – commonly removed by most
    normalization tools. Prepositions,
    conjunctions, pronouns
   Provide valuable contextual information.
       Blood FOR an HIV-positive patient
       Blood FROM an HIV-positive patient
       Asprin AND warfarin
       Asprin OR warfarin

   Integers
       186 distinct integers or integer word
       Occurred 647 times
       Additional modification of concepts
   Hyperkalemia – 5.3 mEq/li & 8.7 mEq/li
       Both are hyperkalemia, but the evaluation and
        management are markedly different
   Verbs – largest category of unmatched
       Include action and relation concepts
       Non-medical lexicon contained some
            Treats, attends, increases, lessens, reduce, follows, starts,
             can, should, is, equal, improve
   Verb tense changes the meaning of a
       In a patient TAKING antibiotics
       In a patient who TOOK antibiotics

   Verbs may be conceptually related to
    medical concepts
       Diagnose    =>   Diagnosis
       Treat       =>   Treatment
       Evaluate    =>   Evaluation
       Prescribe   =>   Prescription
   In these cases the verb (relationship) is
    not equivalent to the noun (concept)
   We developed an application to
       Parse individual words from collections of medical
       Match the words (phrases) with lexical sources,
        codified by the UMLS
   Our results were better than previous
    investigators (for percentage of matched words)
   We still have some work to do….
               Related Experiments

   We attempted to cluster questions by
    sequences of semantic types
       Initial attempts mostly clustered common
        phrases such as “How should I” and “What is
       We may repeat this method after discarding
        „stop phrases‟
                Future Work

   Family Practice Inquiries Network (FPIN)
    has 200 questions that have associated
    MeSH terms manually assigned by
   We will look at these question-term
    groups for clustering purposes (with the
    hypothesis that they will not make distinct
              Future Work

I will work with researchers at NLM to apply
  MetaMap to medical questions
 extract triplets (Medical Concept-Allowable
  Relation-Medical Concept) from questions.
 Insert the triplets into a vector-space
  model and look for clusters


To top