; Analysis of Protein Geometry_ Pa
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Analysis of Protein Geometry_ Pa

VIEWS: 3 PAGES: 56

  • pg 1
									A Vision for Using All the Data and
Publications from Science on Web:
           Mining this to
  Study the Structure of Science




                                                                                                                    1 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
               Mark B Gerstein
       Yale (Comp. Bio. & Bioinformatics)

 NSF Workshop on Knowledge Management and Visualization Tools
                   2008.03.11, 09:30-10:00




                                                                                                                      Gerstein.info/talks (c) 2008
        Slides downloadable from Lectures.GersteinLab.org
                             (Please read permissions statement.)


                (Textmining talk, fits into ~30' w. interrupting questions)




                                                                              Do not reproduce without permission
                                      [Greenbaum & Gerstein, Nat. Biotech. ('03)]




                                                                               langscape
                                                                              changing the
                                                                             DBs in science,
                                                                             Rapid growth in




Do not reproduce without permission




      Gerstein.info/talks (c) 2008
    2 2 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
    Part of the Changing Landscape:
   Situation Facing DBs and Journals
• Distinctions Blurring
   Reading Journals via queries




                                                                                                               3 3 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
     • Reading DB entries
   Towards reading literature with computers
     • Mining text and correlating papers


• Biology as a science of heterogeneous facts
   Well-suited to database storage




                                                                                                                 Gerstein.info/talks (c) 2008
                  [Gerstein, Bioinformatics ('99); Gerstein & Junker. Nature Yearbook ('02)]
                                                                         Do not reproduce without permission
               Conventional Challenge
    • Hard to keep up with volume and growth of
      publications
    • Missed opportunities in connections between fields
    • Harness the power of technology to help scientists
      share information?

                        New Vision




                                                                                              (c) 2008
    • Discover new scientific relationships
    • Study the Structure of Science itself




                                                                                          4 Gerstein.info/talks
4
                                                    Do not reproduce without permission
                       Overall
                       Process
                       of Web
                       Mining
                                      .




                                                             5 5 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
  Fig. removed
until paper in press




                                                               Gerstein.info/talks (c) 2008
                             [Rzhetsky et al,
                                Cell ('08,
                               submitted)]
                       Do not reproduce without permission
Overall
Process
of Web
Mining
Digest Texts
into Simpler




                                      6 6 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
  Machine
Understand-
 able Form
  and then
 Synthesize




                                        Gerstein.info/talks (c) 2008
      [Rzhetsky et al,
         Cell ('08,
        submitted)]
Do not reproduce without permission
           Overall
           Process
           of Web
           Mining
Doing better science:
 Finding new protein
  relationships (e.g.




                                                 7 7 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
protein interactions),
looking for inconsist-
encies in arguments,
 assembling consen-
    sus definitions
     automatically

      Krauthammer et al.
  Molecular triangulation: bridging




                                                   Gerstein.info/talks (c) 2008
   linkage and molecular-network
information for identifying candidate
genes in Alzheimer's disease.      PNAS
('04); Iossifov et al. Probabilistic
inference of molecular networks from
         noisy data sources.
     Bioinformatics ('04)
             [Rzhetsky et al,
                Cell ('08,
               submitted)]
           Do not reproduce without permission
Overall
Process
of Web
Mining
 Mapping
  Science




                                      8 8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
     +
Studying its
Dynamics &
 Evolution




                                        Gerstein.info/talks (c) 2008
      [Rzhetsky et al,
         Cell ('08,
        submitted)]
Do not reproduce without permission
   Overall
   Process
   of Web
   Mining
• Revealing
patterns of
collaboration




                                         9 9 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
• Understanding
basis of terms &
nomenclature
•Tracking the
evolution of ideas
• Models for the
evolution of




                                           Gerstein.info/talks (c) 2008
science;
• Helping set policy
& research
directions
         [Rzhetsky et al,
            Cell ('08,
           submitted)]
   Do not reproduce without permission
 Overall
 Process
 of Web
 Mining
  Making it
 understand-




                                       10 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
able (through
 “mashup”)




                                        10 (c) Mark Gerstein, 2008
       [Rzhetsky et al,
          Cell ('08,
         submitted)]
 Do not reproduce without permission
Examples Illuminating Current State of Affairs:


  Mining Simple Term Occurrence
Statistics to Understand and Justify
        Directions in Science




                                                                                    (c) 2008
                                                                                11 Gerstein.info/talks
                                          Do not reproduce without permission
Over-representation of crystallography
 among the Nobel Prizes, highlighted
         by the 2006 Nobels




                                                                                        (c) 2008
                                                                                    12 Gerstein.info/talks
                        [Seringhaus & Gerstein, Science (2007)
                                              Do not reproduce without permission
        The current state of mammalian gene annotation:
              a rationale for data driven research



p53
 TNF
      VEGF




                                                                                          13 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
      EGFR
       ER
        NFKB
          TGFB
             IL-6
                COX2
                    BCL2




                                                                                           13 (c) Mark Gerstein, 2008
                        Adapted from Su and Hogenesch, Genome Biology, 2007 permission
                                                               Do not reproduce without
Gene Name Skew




                                                                             14 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                                              14 (c) Mark Gerstein, 2008
                 [Seringhaus et al. GenomeBiology (2008)]
                                       Do not reproduce without permission
                   Ex. Naming Issue: Starry Night

      Starry night (P Adler, ’94)




                                                                                      15 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                                                       15 (c) Mark Gerstein, 2008
[Seringhaus et al. GenomeBiology (2008)]        Do not reproduce without permission
    Naming
  Pathologies:
Related to Single
     Genes




                                                                           16 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
 (b) drop dead: flies with mutations in drop dead die
 rapidly after their brain rapidly deteriorates. (c) malvolio:
 gene needed for normal taste behaviour. Malvolio in
 Shakespeare's Twelfth Night tasted "with distempered
 appetite". (d) LOV: light, oxygen, or voltage (LOV) family
 of blue-light photoreceptor domains. (e) yuri: this gene




                                                                            16 (c) Mark Gerstein, 2008
 was discovered on the anniversary of Yuri Gagarin's
 space flight. Mutants have problems with gravitaxis and
 cannot stay aloft. (f) tribbles: cells divide uncontrollably,
 like the eponymous Star Trek characters. (g) kuzbanian:
 mutants have uncontrollable bristle growth. Koozbanians
 are alien Muppets with uncontrollable hair growth;
 spelling was changed to avoid copyright infringement. (h)
 ring: really interesting new gene. (i) yippee: a graduate
 student’s reaction on cloning the gene


         [Seringhaus et al. GenomeBiology (2008)]
                                     Do not reproduce without permission
     Naming
   Pathologies:
Involving Multiple
   Gene Names




                                                                          17 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
 (j) kryptonite and superman: the kryptonite mutation
 suppresses the function of the SUPERMAN gene. (k)
 arleekin, valient, tungus: mutations in arleekin, valient,




                                                                           17 (c) Mark Gerstein, 2008
 tungus and 29 other genes affect long-term memory.
 Named after Pavlov's dogs. (l) PKD1 (human) and lov-1
 (worm): these are homologs, although their names do not
 suggest it. (m) MT-1: this label can refer to at least 11
 different human genes. (n) BAF45 and BAF47: names for
 the same gene, reflecting a revision of the molecular
 weight of product.




           [Seringhaus et al. GenomeBiology (2008)]
                                    Do not reproduce without permission
Examples Illuminating Current State of Affairs:
Using Network Representations to
Make Maps of Science -- Studying
   the Publication Patterns of
      Genomics Consortia




                                                                                                  18 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                                                                   18 (c) Mark Gerstein, 2008
                    [Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]
                                                            Do not reproduce without permission
 Co-Authorship
  Publication
Network of Struc.
   Genomics
Consortia (NESG)




                                                                                                  19 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                                                                   19 (c) Mark Gerstein, 2008
                    [Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]
                                                            Do not reproduce without permission
                   Co-
               authorship
                Networks
        (45)   comparing
                the 9 NIH




                                                           20 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
               Structural
               Genomics
                 Centers
(19)




                                                            20 (c) Mark Gerstein, 2008
                    Average
                    Degree
                  [Douglas et al.
                 GenomeBiol. ('05),
       (7)     pubnet.gersteinlab.org]
                     Do not reproduce without permission
                        [Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]
                                                                                     Different Representations of
                                                                                     Publication Network of NESG




Do not reproduce without permission




     21 (c) Mark Gerstein, 2008
    21 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
 Clustering
 structures
 determined
   by struc.
  genomics
  consortia
according to
 functional




                                                                                                                             22 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
  similarity:
  Is there a
 functional
    bias in
  consortia




                                                                                                                              22 (c) Mark Gerstein, 2008
structures?

      Avg.     Avg. Path   Clust.   Diameter
      Degree               Coeff.


PSI     24      2.6        37%         7
PDB      6      3.9        31%         9       [Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]
                                                                                       Do not reproduce without permission
   Making
Larger Maps:
Mapping the
 whole field
of Melanoma
  Research




                                                                                                                                  (c) 2008
                                                                                                                              23 Gerstein.info/talks
     [Boyack, Kevin W., Mane, Ketan and Börner, Katy. (2004). Mapping Medline Papers, Genes, and
        Proteins Related to Melanoma Research. IV2004 Conference, London, UK, pp. 965-971. ]
                                                                                        Do not reproduce without permission
             Backbone Map of Science & Soc. Sci.




                                                                                                                                                  (c) 2008
                                                                                                                                              24 Gerstein.info/talks
                           [http://grants.nih.gov/grants/KM/OERRM/OER_KM_events/Borner.pdf
                                                                                                         64(3), 351-374.40]
Boyack, Kevin W., Klavans, R. and Börner, Katy. (2005). Mapping the Backbone of Science. Scientometrics.Do not reproduce without permission
                   Ranking Journal Influence -
                        Eigenfactor.org

―Ranks journals much as Google ranks
websites.‖
Adjusts for citation differences among 
disciplines




                                                                                                    (c) 2008
                                                                                                25 Gerstein.info/talks
                                           Ranking of journals in
                                           computer science (top
                                           of list).
                                                                                    25


                                                          Do not reproduce without permission
Examples Illuminating Current State of Affairs:
Analyzing the Dynamics of Science




                                                                                    (c) 2008
                                                                                26 Gerstein.info/talks
                                          Do not reproduce without permission
                                1st
                          Pub- Pub-
                 Google                        'Omics terms over the years
                          Med Med
                  Hits
                          Hits  Hit               [Greenbaum et al. Gen. Res. ('01)]
                               Year
Genome          ~1880000 66171 1932
Proteome        ~63,000   703   1995
Transcriptome    3520     72    1997
Physiome          2980    15    1997




                                                                                                                             27 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
Metabolome        349     12    1998
Phenome           4980     6    1995
                                                                    Proteome
Morphome          238      2    1996
                                 PubMed Hits
Interactome        56      2    1999
Glycome            46      1    2000
Secretome          21      1    2000
Ribonome           1       1    2000




                                                                                                                              27 (c) Mark Gerstein, 2008
Orfeome            42      -     -
Regulome           18      -     -
Cellome            17      -     -
Operome            8       -     -
Transportome       1       -     -
Functome           1       -     -                                                     Do not reproduce without permission
                       Evolution of Science



                                                                                  Map based on ―bursty‖
                                                                                  words in life sciences
                                                                                  publications since 1980.
                                                                                  Older fundamental research
                                                                                  (center) led to four different
                                                                                  areas (subgraphs in
                                                                                  corners).

                                                                                   This and other domain
                                                                                   maps at
                                                                                   http://www.scimaps.org.




                                                                                                                                   (c) 2008
                                                                                                                               28 Gerstein.info/talks
                                                                                                                      28

•K. Mane, K. Börner (2004). Mapping topics and topic bursts in PNAS. 101 (Supplement 1): 5287.
                                                                                         Do not reproduce without permission
  RNAi:
Birth of a
 Field in
    the
Literature
 Culmin-
 ating in




                                       29 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
the 2006
  Nobel




                                        29 (c) Mark Gerstein, 2008
      Source:
 Gerstein & Douglas.
PLoS Comp. Bio. 3:e80
       (2007)
PubNet.GersteinLab.org
 Do not reproduce without permission
The Social Dynamics of Innovation and
        Scientific Discovery:
      Making Models of Science

                                      The Patterns of
                                      Discovery and the
                                      Spread of Ideas as
                                      Epidemics.



                                     The population dynamics of authors in an emerging field is well described by models
                                     similar to those of epidemics, but that take into account contact processes and
                                     intentionality characteristic of human social dynamics. Panel (a) shows a SEIRZ
                                     model, (b) its best solution applied to the spread of Feynaman diagrams in the USA,
                                     Japan and the Soviet Union, and (c) details the model parameter's interpretation. The
                                     spread of ideas is characterized by relatively low contact rates (compared to infectious




                                                                                                                                    (c) 2008
                                     diseases), and very long lifetimes for the idea, as well as intentional strtuctures to
                                     promote interaction between individuals during the learning process.




                                                                                                                                30 Gerstein.info/talks
                                                                                                                 30

                  [Bettencourt et al., Physica A (2006)]               Do not reproduce without permission
     Examples Illuminating Current State of Affairs:


    Mashing up the Text from Scientific
    Publications with other information
      sources to make Science more
              Understandable
• Mashing up scientific texts with streamed video,
  genome annotation, protein structure & interactions
• SciVee
    http://www.scivee.tv
    Partnership: NSF, PloS, San Diego Supercomputing Center




                                                                                                                        (c) 2008
    Pubcasts—video correlated with PLoS papers automatically displayed as video runs
    Videos—scientists upload their own without papers




                                                                                                                    31 Gerstein.info/talks
• Journal of Visualized Experiments (JoVE)
    http://www.jove.com
    Monthly issues of theme-related videos
    Procedure walk-throughs, interviews
    High-quality video and sound
                                                                              Do not reproduce without permission
SciVee




                                                   (c) 2008
                                               32 Gerstein.info/talks
                                   32


         Do not reproduce without permission
JoVE




                                                 (c) 2008
                                             33 Gerstein.info/talks
                                 33


       Do not reproduce without permission
            Fusing Data & Papers
           to Annotate the Genome
• Ideal project for 21st century is annotating every base
  of the genome
   Want to attach all publications and results to the genome
   "Fly through Genome" as way to access and understand the
    literature




• Problem of a good browser....




                                                                                                        (c) 2008
                                                                                                    34 Gerstein.info/talks
                                        [Gerstein, Science ('00), Nature ('06)]
                                                              Do not reproduce without permission
               We need a Google Earth for the
              Genome; A Step in this Direction...




                                                                                                                   (c) 2008
                                                                                                               35 Gerstein.info/talks
[Herr, Holloway, Börner, Emergent Mosaic of Wikipedian Activity, 2007]   Do not reproduce without permission
                                                    Vision
                                               Impediments to the




Do not reproduce without permission




     36 (c) Mark Gerstein, 2008
    36 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
• Need to perform a
                                           DB
  “distributed query” over           Interoperation
  many sites                          & Federated
   Conventional web links             Information
   More complex interfaces           Architecture
• Annotation of the human




                                                                                   37 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
  genome involves a massive
  federation of interoperating
  servers
   "Administered" by many




                                                                                    37 (c) Mark Gerstein, 2008
    disparate people and groups



[Smith et al., BMC Bioinfo. ('07)]
                                             Do not reproduce without permission
 Impediment #1:
  Structuring the
   Information




                                                       38 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
Correctly for Large-
   scale Query




                                                        38 (c) Mark Gerstein, 2008
                 Do not reproduce without permission
                               Structuring the Information
    EF2_YEAST                            Curated DB entries in Uniprot
                                         vs Unstructured scientific text
                                 Structured Semantic Web [Berners-Lee et al, Sci.
                                  Am. (2001)] vs purely unstructured text mining
  Descriptive Name:
  Elongation Factor 2




                                                                                                              39 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
   Lots of references
       to papers

   Summary sentence




                                                                                                               39 (c) Mark Gerstein, 2008
   describing function:
This protein promotes the
      GTP-dependent
    translocation of the
nascent protein chain from
 the A-site to the P-site of
       the ribosome.           [Smith & Gerstein, Science ('06), Tech Rev. (Jul. '07)]
                                                                        Do not reproduce without permission
      Other Issues with the Current
   Situtation between DBs & Journals
• Not always a clear linkage between papers & DBs
   Keeping entries in DB and paper in sync




                                                                                                                 40 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
• Data aliquot
   Huge datasets are handled but what of isolated facts
• How to connect key attributes of Journals with DBs
     Attribution for credit & accountability
     Time stamping of unchanging entries




                                                                                                                  40 (c) Mark Gerstein, 2008
     Citation and history
     Well worked out process of QC via refereeing and editing
• Readability of Papers
   Detailed data embedded into papers, making text hard to read
                    [Gerstein, Bioinformatics ('99); Gerstein & Junker. Nature Yearbook ('02)]
                                                                           Do not reproduce without permission
    Structured Abstract Proposal as a
              Compromise
• Storing information in papers in machine interpretable
  fashion
   for automatic deposition into DBs
   Abstract + standardized view of all tables
• Cross-referencing it with a specific part of the global




                                                                                                               41 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
  genome, proteome, and interactome
   Article written as annotation from the start
• Done in parallel to submission & revision of normal
  journal article
   Refereed & edited normally




                                                                                                                41 (c) Mark Gerstein, 2008
   Capitalizes on peer review & incentives to publish
• Curators vs editors
   Author is in control and this process
   But it’s officiated by referees and editors
                          [Seringhaus & Gerstein, FEBS ('08); Gerstein et al., Nature ('07)]
                                                                         Do not reproduce without permission
                                      [Seringhaus & Gerstein, BMC Bioinformatics (2007)]   Ex. Structured Abstract




Do not reproduce without permission




     42 (c) Mark Gerstein, 2008
    42 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                     • K.lactis (species)
                                                            KlSTE4 (gene)
                                                              • KlSte4p (protein)                             Ex. Structured
                                                                  – CLONED
                                                                       » Available at …                          Abstract
[Seringhaus & Gerstein, BMC Bioinformatics (2007)]



                                                                  – SEQUENCED
                                                                       » Sequence
                                                                           ATGTACGCTATAGGC….
                                                                  – MUTANTS                                      KlGPA1 (gene)
                                                                       » DELETION                                  • KlGpa1p (protein)
                                                                       » FUNCTIONAL ASSAYS                             – INTERACTIONS
                                                                       » Sterile in both MATa and                           » TWO-HYBRID




                                                                                                                                                                                   43 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                                           MATα                                             » KlSte4 = XXX
                                                                       » No defect in vegetative                   • KlGpa1p* (protein)
                                                                           growth                                      – INTERACTIONS
                                                                       » STRAIN INFORMATION                                 » TWO-HYBRID
                                                                       » Available at….                                     » KlSte4 = XXX
                                                                  – INTERACTIONS                                 KlGPA2 (gene)
                                                                       » TWO-HYBRID                                • KlGpa2p (protein)
                                                                       » KlGpa1p (10x stronger) =                      – INTERACTIONS
                                                                           XXX
                                                                                                                            » TWO-HYBRID




                                                                                                                                                                                    43 (c) Mark Gerstein, 2008
                                                                       » Control (no partner) = XXX
                                                                                                                            » KlSte4 = XXX
                                                                       » KlGpa1p* = XXX
                                                                       » KlGpa2p = XXX
                                                                                                          • S.cerevisiae (species)
                                                                                                                 SCGPA1 (gene)
                                                                       » ScGpa1p = XXX (S.
                                                                           cerevisiae)                             • ScGpa1p (protein)
                                                                  – COMMENTS                                          – INTERACTIONS
                                                                       » Both KlSte4p and KlGpa1p                          » TWO-HYBRID
                                                                           required to induce mating in                    » KlSte4 = XXX
                                                                           K.lactis
                                                                                                                                             Do not reproduce without permission
              Unsupervised Textmining
          vs Manually Curated and Structured
         Documents: Not necessarily a conflict

      • Structured abs.




                                                                               44 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
        might be good
        training sets
        for mining
      • Also, gateway
        to mining




                                                                                44 (c) Mark Gerstein, 2008
[Smith et al., Bioinformatics ('07)]     Do not reproduce without permission
   Impediment #2:
Access Restrictions




                                                       45 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
 Inhibit Large-scale
        Query




                                                        45 (c) Mark Gerstein, 2008
                 Do not reproduce without permission
     Absence of social framework for
      protecting "data" on the web
• Researchers unclear on framework
   The ambiguity of the present copyright laws governing the




                                                                                                 46 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
    protection of databases creates a situation where researchers
    are (practically) unclear about their rights to extract and
    combine data
     • Putting articles up on sites, "quoting" annotation
   Likewise, researchers are unsure how to get "credit" for
    combined data ("Mash ups")




                                                                                                  46 (c) Mark Gerstein, 2008
     • Disincentive to data integration
• Information owners, unsure of how laws safeguards
  their information, overprotect their data with licenses
  and technological mechanisms that impede
  interoperation.
                           [Greenbaum & Gerstein, Nat. Biotech. ('03)]
                                                           Do not reproduce without permission
           Technological safeguards
               to "protect" data
• Limits on Bulk Downloads &         • Databases can be stored in
  Global Analysis                      propriety formats
   Passwords and IP filtering            Extreme is encryption
     • allow the database owner       • Watermarking adds overt or




                                                                                                     47 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
       to limit access to specific      hidden digital fingerprints
       users and computers
                                          Slightly corrupting the data.
     • selectively cut off access
                                          Not that common in bio-DBs
       to researchers performing
                                           (but found in British Library).
       bulk calculations.
   Data can also be presented
    piecemeal, in response to a




                                                                                                      47 (c) Mark Gerstein, 2008
    specific user query
   Examples
     • Incyte Proteome database
     • Cellzome database of
       interactions.
                                [Greenbaum & Gerstein, Nat. Biotech. ('03)]
                                                               Do not reproduce without permission
                   Free text Issue is Part
                   of this Larger Context
    • Different traditions in academic publishing vs DB world
        Genome sequence is free
        but have to pay for article about it!
    • Many free text initiatives




                                                                                                     48 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
        PubMedCentral.NIH.gov & arXiv.org
    • Tricky economics of free text
        potentially efficient
        but redistributes dollars in world of academic publishing
        who pays: readers or writers




                                                                                                      48 (c) Mark Gerstein, 2008
[Greenbaum et al. (2003) Interdiscip Sci Rev 28: 293-302.]     Do not reproduce without permission
  Impediment #3:
     Security
  Considerations




                                                       49 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
Inhibit Large-scale
       Query




                                                        49 (c) Mark Gerstein, 2008
                 Do not reproduce without permission
                         [Greenbaum et al., Nat. Biotech. ('04); Smith et al., GenomeBiol. ('05)




                                                                                                               West" Internet
                                                                                                   Vast Computer Security Costs in the "Wild




Do not reproduce without permission




     50 (c) Mark Gerstein, 2008
    50 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                                           Vast difficulty in securing information
                                                                                     servers in academia
                                                                          • Mundane administration — patches
[Greenbaum et al., Nat. Biotech. ('04); Smith et al., GenomeBiol. ('05)




                                                                          • Make building intricate systems for interoperation




                                                                                                                                                                          51 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                                            difficult, as researchers have to continually check their
                                                                            interfaces for "holes"
                                                                          • Unique impact on research (vs business)
                                                                             Free and broad dissemination of ideas between labs and
                                                                              public is hallmark of research.
                                                                             Preserving openness precludes standard security practices




                                                                                                                                                                           51 (c) Mark Gerstein, 2008
                                                                              often employed in a corporate or military environment -- e.g.
                                                                              private networks
                                                                             Academic computer users exhibit great variability, making
                                                                              effective security procedures more difficult

                                                                                                                                    Do not reproduce without permission
 A Vision for Harnessing the Volume of Information
   on the Web to Study the Structure of Science

• Main Applications of Large-scale         • Impediments Large-
  Mining                                     scale Mining
   New Scientific Discoveries               (as Distributed Query)
    (not disc. here)                           (Semi) Structuring
   Understanding Areas of Study through        the Information in
    Simple Zipf Stats                           Journals
     • Crystallography Nobel, Genomics,        Overcoming access
       Gene Naming                              restrictions
   Maps of Science                            Security




                                                                                                    (c) 2008
     • Studying a genomics consortia,           Considerations
       Bigger Map, Ranking Journals




                                                                                                52 Gerstein.info/talks
   Dynamics of Science
     • Watching and modeling the
       appearance of new terms, RNAi
       ex.
                                                          Do not reproduce without permission
                         MS




                                                                         53 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                              MG




                                                                          53 (c) Mark Gerstein, 2008
 Acknowledgements
Acknowledgements
TopNet.GersteinLab.org
                                   Do not reproduce without permission
                                                               Acknowledgements
                                      TopNet.GersteinLab.org
                                                                                       MS




                                                                                  MG




Do not reproduce without permission




     54 (c) Mark Gerstein, 2008
    54 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
                                                               Acknowledgements
                                      TopNet.GersteinLab.org
                                                                                       MS




                                                                                  MG




Do not reproduce without permission




     55 (c) Mark Gerstein, 2008
    55 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
   NIH, NSF, Keck

                         MS
                         M Seringhaus




                                                                                             56 Gerstein.info/talks (c)2002, Yale, bioinfo.mbb.yale.edu
 Job opportunities
   currently for                           MG
    postdocs &           K Cheung       D Greenbaum
     students
                             M Schultz




                                                                                              56 (c) Mark Gerstein, 2008
                           G Montelione

                          S Douglas A Smith
 Acknowledgements               K Yip
TopNet.GersteinLab.org
                                                P Cayting
                                                       Do not reproduce without permission

								
To top
;