Opportunities for Text Mining in Bioinformatics
(CS591-CXZ Text Data Mining Seminar)
Dec. 8, 2004
ChengXiang Zhai Department of Computer Science
University of Illinois, Urbana-Champaign
Why Biology Text Mining?
• Strong motivations from biology side
– Difficulty for biologists to access literature
• No theory in biology, so we must keep all literature “alive”
• Observations about the same biology mechanism may be described in different terms (e.g., due to different perspectives of study)
– Many unanswered research questions – Text mining may help better organize, link biology literature, and answer simple questions… (e.g., what do we know about this gene? )
Why Biology Text Mining? (cont.)
• Potentially high impact from CS side
– Any “discovery” from biology text could be potentially significant – Biology text is relatively “easy” for mining
• Literature is cleaner (compared with web data) • Biology text often has many annotations • Many other kinds of biology data can be exploited (e.g., DNA/Protein sequences, gene expression information, metabolic networks)
– Simple techniques may work
Characteristics of Biology Text
• Large number of entities (e.g., genes,
proteins) that have well-defined semantics
• No standard for terminology (inconsistencies) • Ambiguities (e.g., many acronyms) • Synonyms • High complexity in phrases and sentence
structures
Research Topics
• •
General goal: Applying known text mining techniques to help biology research Problem 1: Data/Information Integration
– How can we integrate text information (discovering terminology linkages) – How can we link text with databases (semantic interpretations of text on top of entities/relations in DB, e.g., entity extraction) – How can we integrate biology DBs (many fields are text)
•
Problem 2: Functional annotations
– How can we annotate a biological entity (e.g., a gene) with functional information extracted from literature – How can we annotate a set of related genes with functional information – How can we exploit the ontologies/thesauri in biology?
Research Topics (cont.)
•
Problem 3: Data/Information Cleanup & Curation
– How can we detect suspicious data/information in existing databases?
– How can we automate many manual tasks of database curation?
•
Problem 4: Research question answering
– How can we answer simply research questions? (e.g., what functional connections are there between these two genes?)
– How can we support exploratory access and digest of literature information? (e.g., a biology research workbench)