Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu Research Purpose To find a method for classifying medical questions that are asked by clinicians Hypothesis - Simply indexing by keywords isn‟t enough to distinguish questions with different meanings but similar wording, or to group questions with similar meanings but different words. Definitions Semantic Information – the meaning of the words Syntactic Information – the parts of speech of the words (word type, sentence part) Medical Questions – a question asked by a clinician Lexical Sources – sources of words and vocabularies UMLS – Unified Medical Language System UMLS Ambitious project of the National Library of Medicine, begun in 1986 Help researchers retrieve and integrate electronic biomedical information from a variety of sources Links over 100 controlled vocabularies Assigns unique identifiers to medical concepts and strings Maps the hierarchical relationships between the medical concepts Why Bother? (To classify medical questions?) Clinicians have questions when treating patients Researchers have gathered collections of these questions No good method exists to classify the questions How many times has a particular question been asked? Which questions should receive priority for evidence-based answers? Examples What is the best way to treat acute pharyngitis? How should I approach a patient with a sore throat? What should I do with a patient with diabetes and insulin resistance? What should I do with a patient with diabetes who is resistant to taking insulin? Methods Source Questions American researcher – observed clinicians at work British researchers – questions sent in by clinicians – answered by researchers Australian researchers – questions sent in by clinicians – answered by researchers 4083 total questions Methods Source Vocabulary MRCON – a table from the Metathesaurus Lists the medical concepts by unique identifiers (CUI) and each string associated with a concept unique (string => 1 concept) ambiguous (string => 2+ concepts) COLD – ambient temperature, viral respiratory infection, chronic obstructive lung disease 2,247,454 strings associated with concepts Non-medical Lexicon – from Roget‟s Thesaurus Query objects (why, when, how), identifiers (I, you, he), modifiers (soon, frequently) 749 terms in this lexicon String Matching Parsing program (written in C) Separates individual questions into 3-word, 2- word, 1-word windows Matches the window against MRCON and our lexicon Generates a report of: Total number of words parsed Number of matches from unique, ambiguous, non- medical lists Strings that didn‟t match any of the lists Results String – individual word or words that matched Hits – how often the string was found Words – total number of matching words (some strings have more than one word in them) Strings Hits Words % match MRCON 4,534 24,844 30,186 42.3% Unique MRCON 574 9,256 9,769 13.7% Ambiguous Non- 208 16,768 17,783 24.9% medical Unmatched 2,321 13,624 19.1% Results 100 strings occurred 7850 times – or 57.6% of the total matches 712 strings => 3+ hits, 85% of all hits Our focus was on strings that didn‟t match one of the source vocabularies 19.1% didn‟t match Hypothesis that additional terms not found in MRCON will be important for indexing Results Unmatched words – 2+ occurrences Unique words Total Number Percent Verb 261 3676 31.7% Noun 186 2356 20.3% Preposition 9 2544 21.9% Adj/Adv/Conj 103 1095 9.5% Mix * 72 810 7.0% Pronoun 10 614 5.3% Integer 70 502 4.3% * can be more than one word type, depending on the context. Attacks, step, process all can be nouns or verbs Discussion MRCON – selected because of low rate of ambiguous string-CUI combinations 89% unique string matches 11% ambiguous string matches Other tables have greater word coverage, but have more ambiguity for each of the words Discussion Our word-matching results were similar to other researchers Cimino matched 43% of words with Meta-1 (we had 56% MRCON matches) Computers & Biomedical Research. Aug 1992;25(4):366-373. Hersh matched 60% of words to medical terminology & names dictionary (we had 79% combined lexicon matches) Proceedings/AMIA Annual Fall Symposium. p. 1997. Discussion Stop words – commonly removed by most normalization tools. Prepositions, conjunctions, pronouns Provide valuable contextual information. Blood FOR an HIV-positive patient Blood FROM an HIV-positive patient Asprin AND warfarin Asprin OR warfarin Discussion Integers 186 distinct integers or integer word combinations Occurred 647 times Additional modification of concepts Hyperkalemia – 5.3 mEq/li & 8.7 mEq/li Both are hyperkalemia, but the evaluation and management are markedly different Discussion Verbs – largest category of unmatched words Include action and relation concepts Non-medical lexicon contained some Treats, attends, increases, lessens, reduce, follows, starts, can, should, is, equal, improve Verb tense changes the meaning of a question In a patient TAKING antibiotics In a patient who TOOK antibiotics Discussion Verbs may be conceptually related to medical concepts Diagnose => Diagnosis Treat => Treatment Evaluate => Evaluation Prescribe => Prescription In these cases the verb (relationship) is not equivalent to the noun (concept) Summary We developed an application to Parse individual words from collections of medical questions Match the words (phrases) with lexical sources, codified by the UMLS Our results were better than previous investigators (for percentage of matched words) We still have some work to do…. Related Experiments We attempted to cluster questions by sequences of semantic types Initial attempts mostly clustered common phrases such as “How should I” and “What is the” We may repeat this method after discarding „stop phrases‟ Future Work Family Practice Inquiries Network (FPIN) has 200 questions that have associated MeSH terms manually assigned by librarians. We will look at these question-term groups for clustering purposes (with the hypothesis that they will not make distinct clusters). Future Work I will work with researchers at NLM to apply MetaMap to medical questions extract triplets (Medical Concept-Allowable Relation-Medical Concept) from questions. Drug-treats-Disease Insert the triplets into a vector-space model and look for clusters Thank-you!! ???
Pages to are hidden for
"Codifying Semantic Information in Medical Questions Using Lexical"Please download to view full document