Docstoc

Data Modeling The Entity-Relationship Model - PowerPoint

Document Sample
Data Modeling The Entity-Relationship Model - PowerPoint Powered By Docstoc
					Towards Evidence-Based Discovery


               Catherine Blake
   School of Information and Library Science
   University of North Carolina at Chapel Hill
       http://www.ils.unc.edu/~cablake
             cablake@email.unc.edu
                        Motivation
• Relentless increase in electronically available text
   – Life Sciences
       • 17 millionth entry added in April 2007
       • 5,200 journals indexed
       • 12,000 new articles each week !
   – Chemistry – more than 110,000 articles in 1 year alone
• Consequences:
   – Hundreds of thousands of relevant articles
   – Implicit connections between literature go unnoticed

      Shift from Retrieval to Synthesis                       2
          Information Overload
“One of the diseases of this age is the multiplicity of
  books; they doth so overcharge the world that it is
  not able to digest the abundance of idle matter that
  is every day hatched and brought forth into the
  world”
                    - Barnaby Rich, 1613




                                                          3
                     Evidence-Based Discovery
           If I have seen further      We can't solve problems
            than others, it is by       using the same kind
             standing upon the          of thinking we used
            shoulders of giants.       when we created them.
                 Sir Isaac Newton          Albert Einstein


                    Goal: Facilitate Discovery from Text
        To make easy or easier1         A productive insight1
                                                                 4
1   American Heritage Dictionary
                                        Human-assisted
        Natural Language
                                        Discovery and
           Processing    Core
                                          Synthesis
Genomics                     Education
                         Discovery Science
 News                  Evidence-based Practice

                         Human Discovery
 Chemistry
                          and Synthesis
  DocSouth
             Breast Cancer
                                                Synthesis and
Heterogeneous Literature             Discovery Work Practices
                                                                5
                  Outline
• Motivation
• Case Studies
  – METIS
     • Human synthesis
     • Natural language processing
  – Claim Jumping through Scientific Literature
• Next Steps
• Summary
                                                  6
       Systematic Review Process
–   Formulate the problem
                                  28 months from
–   Locate and select studies      initial idea to
–   Assess quality of studies       publication
–   Collect data
                                  Increased demand
–   Analyze and present results
                                   due to evidence-
–   Interpret results              based medicine
–   Improve and update review
 Guesswork guided
  by scientifically
  trained intuition
                             Manual Synthesis
    Rescher (1978)




MEDLINE         Hypothesis               Context
                Projection             Information



Embase                        Corpus
                Select
                Retrieval              Extraction
                                       Extract          Facts
                                                                Verification
                                                                Verify          Analysis
                                                                               Analyze

                                          Collaboration
                                            Iteration
           Context Information
• Study Information
  – e.g. date, location, ...       Loosely coupled
• Population Information           to review focus
  – e.g. gender, age, ...
• Risk Factor or Intervention
  – e.g. duration of exposure, confounders
                                   Tightly coupled
• Disease                          to review focus
  – e.g. stage, confounders
   Collaborative Information Synthesis


MEDLINE   Hypothesis              Context                External
          Projection            Information               Data



Embase                 Corpus
          Retrieval             Extraction               Verification   Analysis
                                                 Facts


                                   Collaboration
                                     Iteration
       Key: Estimate Missing Information
              1                                                                       2
                   What are people with                                                   What are people in a similar
                  Breast Cancer exposed to?                                               population exposed to?


                               Facts for each study
                               •number of patients                                                      Codebook
    Studies with               •age of patients                             Database of                 •question asked
    Breast Cancer              •geographic location                         risk factors                •age, gender
    patients                   •risk-factor exposure …                      BRFSS                       •% responses




                                         3
                                          Are these rates significantly different?




T. Tengs & N. D. Osgood (2001) “The link between smoking and Impotence: Two Decades of Evidence”, Preventive Medicine, 32:447-52
  More than Automated Meta-Analysis
                                            • Traditional analysis
                                               – same study design
 Systematic Review                             – medicine = RCT
                                               – epidemiology = cohort



                                            • Information Synthesis
                                               – any study that includes
           Information Synthesis                 required information
                                               – augment missing
Key                Entire     Main topic         information
      External     study      Secondary
      database                Information
                                        Human-assisted
        Natural Language
                                        Discovery and
           Processing    Core
                                          Synthesis
Genomics                     Education
                         Discovery Science
 News                  Evidence-based Practice

                         Human Discovery
 Chemistry
                          and Synthesis
  DocSouth
             Breast Cancer
                                                Synthesis and
Heterogeneous Literature             Discovery Work Practices
                                                                13
          METIS Information Extractor
   • Semantic Grammar
   • Features: words, numbers, and semantic types in the
     Unified Medical Language System (UMLS)
{term;’age’} {term:’of’}    {number;10<n2<110}{term;’to’}{number;10<n2<110}


 The age of breast cancer subjects ranged between 20 to 64 years old.

                   {semantic type: neoplastic process, or disease}


   • Information extracted :
   • risk factor exposure (tobacco and alcohol )         gender
   • age (min, max, mean)                                start and end dates
                                                                               14
   • number of subjects with medical condition           geographical location
METIS Info Extractor – Evaluation
• Diverse text corpus
   – epidemiology, surgery, biology, ...
   – cohort studies, case-control trials, ...
• Evaluation
   – Metrics (precision, recall)
   – Annotators (developer, domain expert, expert
     annotator, novice)
   – Primary topic (breast cancer, impotence)
   – Secondary information (tobacco and alcohol
     consumption)
METIS Info Extractor – Recall
         1.0

         0.9

         0.8

         0.7

         0.6
Recall




         0.5

         0.4                      Development
         0.3                      Domain Expert
         0.2                      Expert Annotator
         0.1                      Novice Annotator
         0.0
               1   2          3       4         5
                       Rank
METIS Info Extractor – Precision
             1.0
             0.9                  Development
             0.8                  Domain Expert
             0.7                  Expert Annotator
 Precision




             0.6                  Novice Annotator

             0.5
             0.4
             0.3
             0.2
             0.1
             0.0
                   1   2    3        4            5
                           Rank
            METIS Verifier


Converted
Article                         Electronic
                                version of
                                article




                  Verify
                  information
                  extracted
METIS Verifier
                 METIS Analyzer
• Meta-Analysis
   –   Developed for agricultural application
   –   Requires empirical studies with a quantitative outcome
   –   Unit of study is an article - not a person
   –   Result – a unitless metric called an effect size
• Two common meta-analysis techniques
   – Fixed effects
                                       Evaluation: Compared generated
   – Randomized-effects model          effect size with examples in text
                                       books and published articles
                                       ,

                                       Result: Same effect size
     Synthetic Estimate Evaluation
                              1        Actual
                                       Estimated
                             0.8
              Control Rate
    Tobacco                  0.6
                             0.4
Consumption                  0.2
                              0
                                         1         2           3            4   Average
                                                       Article Identifier


                                   1   Actual
                                       Estimated
                              0.8
              Control Rate




                              0.6
    Alcohol                   0.4
Consumption                   0.2
                                   0
                                             1     2            3           4   Average
                                                       Article Identifier
             Outline
• Motivation
• Case Studies
  – METIS
  – Claim Jumping
     • Human discovery
     • Natural language processing
     • Human-assisted discovery and synthesis
• Next Steps
• Summary                                       24
                                        Human-assisted
        Natural Language
                                        Discovery and
           Processing    Core
                                          Synthesis
Genomics                     Education
                         Discovery Science
 News                  Evidence-based Practice

                         Human Discovery
 Chemistry
                          and Synthesis
  DocSouth
             Breast Cancer
                                                Synthesis and
Heterogeneous Literature             Discovery Work Practices
                                                                25
             Human Discovery
• Day-to-day activities of scientists reflect
   – the complex socio-technical environments in
     which successful creativity tools will eventually be
     embedded
   – the human cognitive processing surrounding
     creativity
• Unit of analysis: a paper or grant proposal
  How do chemists arrive at their research question ?
How do chemists transform an idea into a publication ?
                  Approach
• Recruitment
  – experienced scientists (7-45 yrs)
  – local chemists and chemical engineers
  – response rate 84% (21/25)
• Semi-structured interviews
• Critical incident technique
  1. seminal paper in their field
  2. recent paper authored by the participant
  3. paper authored by the participant that they were
     particularly proud of
                   Interview Questions
• Discovery Questions
   –   What is your definition of discovery ?
   –   What evidence convinced you that the paper addressed the initial research questions ?
   –   What factors limited the adoption and deployment of the discovery ?
   –   How did you arrive at the research question ?
   –   What if any existing evidence prompted the study/experiment ?
   –   Were there any alternative explanations ?

• Information Usage questions
   – Other than the scientific literature, what information resources do you draw from to
     aid in your research processes ?
   – How many articles did you read last month that related to each of those projects ?
   – Is that typical of how many articles you read in a month for research projects ?
   – Do you read articles for another purpose ? If so what?
   – How many hours do you spend reading journal articles for research projects?
   – Which journals do you typically read and draw from ?
   – How would you characterize the journals that you read- are they only within your
     domain, or do you read journals that would be considered non-traditional in your
     research ?
   – If you only have a few minutes to read an article, what parts would you read?
   – What do you do with the article once you have read it ?
  Chemists and Chemical Engineers
• Compared with other scientists chemists and
  chemical engineers
  – read more (Brown,1999)
  – have more personal subscriptions to journals (Noble &
    Coughlin, 1997)
  – spend more time reading (Tenopir & King, 2003)
  – visit the library more often (Brown, 1999)
• Consequences
  – information disseminated quickly
  – information has a relative short lifespan
        Human Discovery Findings
• Discovery definition
   – Novelty                 - Balance theory and experimentation
   – Build on existing ideas - Practical application
   – Simplicity

• Hypothesis generation
   – Discussion              - Previous experiments
   – Combine expertise       - Read literature

• Hypothesis validation
   – Iterative               - Tightly coupled
                                        Human-assisted
        Natural Language
                                        Discovery and
           Processing    Core
                                          Synthesis
Genomics                     Education
                         Discovery Science
 News                  Evidence-based Practice

                         Human Discovery
 Chemistry
                          and Synthesis
  DocSouth
             Breast Cancer
                                                Synthesis and
Heterogeneous Literature             Discovery Work Practices
                                                                31
         Causal Relationships
• Newspaper genre
  – Causal relationships (Khoo, Chan, & Niu, 1998)
• Biomedical genre
  – Causes and treats (Price & Delcambre, 2005)
  – Causal knowledge (Khoo, Chan, Niu, 2000)
• Universal Grammar
  – Causatives (Comrie, 1974, 1981)
  – Action verbs (Thomson, 1987)
                                                     32
               Claim Definition
• “To assert in the face of possible contradiction”
• Example sentence reporting a claim
  – “This study showed that Tamoxifen reduces the
    breast cancer risk”
• Example Claim Framework
  – Tamoxifenagent
  – reduceschange
  – [breast cancer risk] object
                                                    33
         The Claim Framework
• Goal
  – go beyond genes and proteins
  – differentiate between different levels of
    confidence in the claim
  – consider claims made in the full text
• Working hypothesis
  – literature will report findings using constructs
    within the Claim Framework
  – human annotators will agree on facets
                                                       34
               Preliminary Results
• 29 articles from TREC Genomics
   –   Total number of sentences: 5535
   –   Sentences with >=1 claim: 1250 (22.6%)
   –   Total number of claims: 3228
   –   Average claims per sentence: 2.51
   –   Claims that did not fit in the Framework: 31

• Per article
   – Average number of sentences: 191
   – Average number of sentences with >=1 claim:43
                                                      35
   Distribution of Claim Categories

 Category      Total (%)     Pilot(%)     Main(%)
Explicit       2489   77.11 332   83.42   2157 76.63
Implicit         87   2.70    3    0.75     84   2.98
Observation    298    9.23   24    6.03    274   9.73
Correlation    174    5.39   12    3.02    162   5.75
Comparison     165    5.11   27    6.85    138    4.9
       Total   3228    100 398      100   2830   100
                                                 36
                                   All Documents
Annotation                  Total (%)       Words (Avg)
Agent                      2894      89.65    5221    1.80
Agent Direction             285       8.83     291    1.02
Agent Modifier             1246      38.60    4448    3.57
Object                     3197      99.04    6849    2.14
Object Direction            271       8.40     283    1.04
Object Modifier            1561      48.36    5383    3.44
Change                     1897      58.77    1953    1.03
Change Direction           1337      41.42    1358    1.02
Change Modifier            1147      35.53    1618    1.41
Claim Basis                 165       5.11     394    2.39
Claim Basis Dir.             42       1.30       43   1.02
Claim Basis Mod.             86       2.66     266    3.09
                                                       37
                   Total   3228              28107    8.70
    Inter Annotator Agreement

Information Facet   Kappa   Agreement
     Agent          0.71    substantial
     Object         0.77    substantial
     Change         0.57    moderate
Change+ChangeDir    0.88    almost perfect


                                        38
           Location of Claims
                  Total Sentences
               With             %       %
Section        Claim Total section claim
Abstract           98     309   31.72   7.84
Introduction      357     979   36.47 28.56
Method              6 1121        0.54  0.48
Result            293 1829      16.02 23.44
Discussion        539 1406      38.34 43.12
Total            1250 5535      22.58 100.00
                                               39
                                        Human-assisted
                                        Human-assisted
        Natural Language
                                        Discovery and
                                         Discovery and
           Processing    Core
                                           Synthesis
                                           Synthesis
Genomics                     Education
                         Discovery Science
 News                  Evidence-based Practice

                         Human Discovery
 Chemistry
                          and Synthesis
  DocSouth
             Breast Cancer
                                                Synthesis and
Heterogeneous Literature             Discovery Work Practices
                                                                40
                                                           Steven W. Matson Ph.D.
                                                           Professor and Chair
User Study                                                 Department of Biology

                                                            Robert C Millikan DVM PhD
Timothy S. Carey, MD, MPH                                   Barbara Sorenson Hulka Distinguished Professor
Sarah Graham Kenan Professor of Medicine                    Department of Epidemiology
Director, Cecil G Sheps Center for Health Services Research School of Public Health

Ila Cote, PhD, DABT                                        Dr. Rosa Perelmuter, PhD
Acting Division Director                                   Director, Moore Undergraduate Research
US Environmental Protection Agency                         Apprentice Program
National Center for Environmental Assessment               Professor of Spanish and Assistant Dean,
                                                           Academic Advising Program
Michael T Crimmins PhD.
Mary Ann Smith Distinguished Professor of Chemistry        Jan F. Prins PhD.
UNC and Department Chair, Department of Chemistry          Professor of Computer Science and
                                                           Chairman, Department of Computer Science
Paul Jones
Clinical Associate Professor                               Alexander Tropsha, Ph.D.
School of Information and Library Science                  Professor and Chair
Director of ibiblio.org                                    Director, Laboratory for Molecular Modeling

Rudy L Juliano PhD.                                        Suzanne West, PhD
Boshamer Distinguished Professor of Pharmacology           Researcher
Principal Investigator, Carolina Center of Cancer          Health, Social and Economics Research41
Nanotechnology Excellence                                  RTI International
                                        Human-assisted
        Natural Language
                                        Discovery and
           Processing    Core
                                          Synthesis
Genomics                     Education
                         Discovery Science
 News                  Evidence-based Practice

                         Human Discovery
 Chemistry
                          and Synthesis
  DocSouth
             Breast Cancer
                                                Synthesis and
Heterogeneous Literature             Discovery Work Practices
                                                                42
            Closing Comments
• Accelerate synthesis
  • Breast cancer study without METIS would take >13 years
  • Without synthetic estimate = systematic review
• Accelerate discovery
  – Connections between literature
  – Speculative and orthogonal views
• Human discovery and synthesis
  – As important if not more so than automation
      “Tap the vast reservoir of human knowledge”
                  Louis Round Wilson, 1929              43
                       Acknowledgements
METIS                                             Claim Jumping
• Funded in part by                               •   Funded in part by
                                                       – Faculty fellowship from the
    – California Breast Cancer Research program
                                                         Renaissance Computing Institute
    – University of California, Irvine                 – UNC Faculty Award
• Thanks to user groups                           •   Thanks to collaborators
    – Particularly to Dr. Adams and Dr. Tengs          • Nassib Nassar and Mats
• Academic mentoring                                     Rynge (RENCI)
    – Primary Advisor: Dr. Wanda Pratt                 • Amol Bapat and Ryan Jones (SILS)
    – Medical Mentor: Dr. Catherine Carpenter
    – Co-Advisors: Dr Dennis Kibler and Dr        Chemists and Chemical Engineers Study
      Michael Pazzani                             •   Funded in part by
    – Committee Member: Dr Paul Dourish                – NSF Center for Environmentally
                                                         Responsible Solvents and Processes
Questions and Comments
       Welcome
               Catherine Blake
           cablake@email.unc.edu
 School of Information and Library Science
 University of North Carolina at Chapel Hill
     http://www.ils.unc.edu/~cablake
                             Publication Bias
• Studies that find a correlation between a risk factor and
  disease are more likely to be published (Easterbrook et al, 1991,
  Ingelfinger et al, 1994)

• METIS provides a new way to explore this bias
                                      Bias introduced by authors, editors, funding, ...

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:13
posted:4/14/2012
language:English
pages:46