Analysis of Protein Geometry_ Pa by fjwuxn

VIEWS: 3 PAGES: 53

									         Text Mining to
 Study the Structure of Science




                                                                                                                                               1 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
                  Mark B Gerstein
          Yale (Comp. Bio. & Bioinformatics)

         Special Session on the Future of Scientific Publishing
                          ISMB'08, Toronto




                                                                                                                                                 Lectures.GersteinLab.org
                            2008.07.23 10:45-11:15
           Slides downloadable from Lectures.GersteinLab.org
                                      (Please read permissions statement.)


(Textmining talk, fits into ~25' but without discussing last two impediments. Not "FULL" version is missing
               graphic from A Rzhetsky. Some images have Picassa tag "kwismb08ppt". )



                                                                                                         Do not reproduce without permission
            GersteinLab.org Research [at ISMB'08]

• Macromolecular Motions [Fri 2:15]
   Analyzing select populations of 3D-structures in detail, trying to
    understand their flexibility in terms of packing (MolMovDB.org)
• Molecular Networks [Mon 10:45]
   Using molecular networks to integrate & mine functional genomics
    information and describe protein function on a large-scale
    (Networks.GersteinLab.org)
• Human Genome Annotation [Mon 2:20]
   Characterizing the function of non-coding regions, focusing on protein
    fossils and novel RNAs (Pseudogene.org + Tiling.GersteinLab.org)




                                                                                                           2 Lectures.GersteinLab.org
• Architectures for Scientific Information
  [Poster Mon. night + Wed 11:15]
   Ways of structuring and textmining scientific publications and relating
    this to genome annotation.

                                                                     Do not reproduce without permission
                                      [Greenbaum & Gerstein, Nat. Biotech. ('03)]




                                                                               langscape
                                                                              changing the
                                                                             DBs in science,
                                                                             Rapid growth in




Do not reproduce without permission




      Lectures.GersteinLab.org
    3 3 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
    Part of the Changing Landscape:
   Situation Facing DBs and Journals
• Distinctions Blurring
   Reading Journals via queries
     • Reading DB entries




                                                                                                               4 4 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
   Towards reading literature with computers
     • Mining text and correlating papers


• Biology as a science of heterogeneous facts
   Well-suited to database storage




                                                                                                                 Lectures.GersteinLab.org
                  [Gerstein, Bioinformatics ('99); Gerstein & Junker. Nature Yearbook ('02)]
                                                                         Do not reproduce without permission
              Conventional Challenge
    • Hard to keep up with volume and growth of
      publications
    • Missed opportunities in connections between fields
    • Harness the power of technology to help scientists
      share information?

                       New Vision




                                                                                          5 Lectures.GersteinLab.org
    • Discover new scientific relationships
    • Study the Structure of Science itself
      (―Science of Science‖ or Science 2.0)
    • Publications as the annotation for the human genome
5
                                                    Do not reproduce without permission
Overall
Process
of Web
Mining
              .




                                      6 6 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
                                        Lectures.GersteinLab.org
      [Rzhetsky et al,
         Cell ('08)]

Do not reproduce without permission
Overall
Process
of Web
Mining
Digest Texts
into Simpler




                                      7 7 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
  Machine
Understand-
 able Form
  and then
 Synthesize




                                        Lectures.GersteinLab.org
      [Rzhetsky et al,
         Cell ('08)]

Do not reproduce without permission
           Overall
           Process
           of Web
           Mining
Doing better science:
 Finding new protein
  relationships (e.g.




                                                 8 8 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
protein interactions),
looking for inconsist-
encies in arguments,
 assembling consen-
    sus definitions
     automatically

      Krauthammer et al.
  Molecular triangulation: bridging




                                                   Lectures.GersteinLab.org
   linkage and molecular-network
information for identifying candidate
genes in Alzheimer's disease. PNAS
('04); Iossifov et al. Probabilistic
inference of molecular networks from
         noisy data sources.
     Bioinformatics ('04)
             [Rzhetsky et al,
                Cell ('08)]

           Do not reproduce without permission
Overall
Process
of Web
Mining
 Mapping
  Science




                                      9 9 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu
     +
Studying its
Dynamics &
 Evolution




                                        Lectures.GersteinLab.org
      [Rzhetsky et al,
         Cell ('08)]

Do not reproduce without permission
   Overall
   Process
   of Web
   Mining
• Revealing
patterns of




                                         10 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
collaboration
• Understanding
basis of terms &
nomenclature
•Tracking the
evolution of ideas
• Models for the
evolution of




                                          10 (c) Mark Gerstein, 2002,
science;
• Helping set policy
& research
directions
         [Rzhetsky et al,
            Cell ('08)]

   Do not reproduce without permission
 Overall
 Process
 of Web
 Mining
  Making it
 understand-




                                       11 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
able (through
 “mashup”)




                                        11 (c) Mark Gerstein, 2002,
       [Rzhetsky et al,
          Cell ('08)]

 Do not reproduce without permission
 Examples Illuminating
Current State of Affairs:

 Mining Simple Term
Occurrence Statistics to
Understand and Justify
 Directions in Science




                                                             12 Lectures.GersteinLab.org
                       Do not reproduce without permission
Over-representation of crystallography
 among the Nobel Prizes, highlighted
         by the 2006 Nobels




                                                                                    13 Lectures.GersteinLab.org
                        [Seringhaus & Gerstein, Science (2007)
                                              Do not reproduce without permission
        The current state of mammalian gene annotation:
               a rationale for data driven research



p53
 TNF
      VEGF




                                                                                           14 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
      EGFR
       ER
        NFKB
          TGFB
             IL-6
                COX2
                    BCL2




                                                                                            14 (c) Mark Gerstein, 2002,
                        Adapted from Su and Hogenesch, Genome Biology, 2007 permission
                                                                Do not reproduce without
Gene Name Skew




                                                                             15 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
                                                                              15 (c) Mark Gerstein, 2002,
                 [Seringhaus et al. GenomeBiology (2008)]
                                       Do not reproduce without permission
                   Ex. Naming Issue: Starry Night

      Starry night (P Adler, ‘94)




                                                                                      16 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
                                                                                       16 (c) Mark Gerstein, 2002,
[Seringhaus et al. GenomeBiology (2008)]        Do not reproduce without permission
    Naming
  Pathologies:
Related to Single
     Genes




                                                                           17 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
 (b) drop dead: flies with mutations in drop dead die
 rapidly after their brain rapidly deteriorates. (c) malvolio:
 gene needed for normal taste behaviour. Malvolio in
 Shakespeare's Twelfth Night tasted "with distempered
 appetite". (d) LOV: light, oxygen, or voltage (LOV) family




                                                                            17 (c) Mark Gerstein, 2002,
 of blue-light photoreceptor domains. (e) yuri: this gene
 was discovered on the anniversary of Yuri Gagarin's
 space flight. Mutants have problems with gravitaxis and
 cannot stay aloft. (f) tribbles: cells divide uncontrollably,
 like the eponymous Star Trek characters. (g) kuzbanian:
 mutants have uncontrollable bristle growth. Koozbanians
 are alien Muppets with uncontrollable hair growth;
 spelling was changed to avoid copyright infringement. (h)
 ring: really interesting new gene. (i) yippee: a graduate
 student‘s reaction on cloning the gene


         [Seringhaus et al. GenomeBiology (2008)]
                                     Do not reproduce without permission
     Naming
   Pathologies:
Involving Multiple
   Gene Names




                                                                          18 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
 (j) kryptonite and superman: the kryptonite mutation
 suppresses the function of the SUPERMAN gene. (k)




                                                                           18 (c) Mark Gerstein, 2002,
 arleekin, valient, tungus: mutations in arleekin, valient,
 tungus and 29 other genes affect long-term memory.
 Named after Pavlov's dogs. (l) PKD1 (human) and lov-1
 (worm): these are homologs, although their names do not
 suggest it. (m) MT-1: this label can refer to at least 11
 different human genes. (n) BAF45 and BAF47: names for
 the same gene, reflecting a revision of the molecular
 weight of product.




           [Seringhaus et al. GenomeBiology (2008)]
                                    Do not reproduce without permission
Examples Illuminating Current State of Affairs:
Using Network Representations to
Make Maps of Science -- Studying
   the Publication Patterns of
      Genomics Consortia




                                                                                                  19 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
                                                                                                   19 (c) Mark Gerstein, 2002,
                    [Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]
                                                            Do not reproduce without permission
 Co-Authorship
   Publication
Network of Struc.
   Genomics
Consortia (NESG)




                                                                                                  20 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
                                                                                                   20 (c) Mark Gerstein, 2002,
                    [Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]
                                                            Do not reproduce without permission
                   Co-
               authorship
                Networks
        (45)   comparing
                the 9 NIH




                                                           21 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
               Structural
               Genomics
                 Centers
(19)




                                                            21 (c) Mark Gerstein, 2002,
                    Average
                    Degree
                  [Douglas et al.
                 GenomeBiol. ('05),
       (7)     pubnet.gersteinlab.org]
                     Do not reproduce without permission
                       [Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]
                                                                                    Different Representations of
                                                                                    Publication Network of NESG




Do not reproduce without permission




     22 (c) Mark Gerstein, 2002,
    22 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
 Clustering
 structures
 determined
   by struc.
  genomics
  consortia
according to
  functional




                                                                                                                             23 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
  similarity:
   Is there a
  functional
     bias in
  consortia




                                                                                                                              23 (c) Mark Gerstein, 2002,
structures?

      Avg.     Avg. Path   Clust.   Diameter
      Degree               Coeff.


PSI     24      2.6        37%         7
PDB      6      3.9        31%         9       [Douglas et al. GenomeBiol. ('05), pubnet.gersteinlab.org]
                                                                                       Do not reproduce without permission
   Making
Larger Maps:
Mapping the
 whole field
of Melanoma
  Research




                                                                                                                              24 Lectures.GersteinLab.org
     [Boyack, Kevin W., Mane, Ketan and Börner, Katy. (2004). Mapping Medline Papers, Genes, and
        Proteins Related to Melanoma Research. IV2004 Conference, London, UK, pp. 965-971. ]
                                                                                        Do not reproduce without permission
             Backbone Map of Science & Soc. Sci.




                                                                                                                                              25 Lectures.GersteinLab.org
                           [http://grants.nih.gov/grants/KM/OERRM/OER_KM_events/Borner.pdf
                                                                                                         64(3), 351-374.40]
Boyack, Kevin W., Klavans, R. and Börner, Katy. (2005). Mapping the Backbone of Science. Scientometrics.Do not reproduce without permission
Examples Illuminating Current
      State of Affairs:
Analyzing the Dynamics
      of Science




                                                               27 Lectures.GersteinLab.org
                         Do not reproduce without permission
                                1st
                          Pub- Pub-
                 Google                        'Omics terms over the years
                          Med Med
                  Hits
                          Hits  Hit               [Greenbaum et al. Gen. Res. ('01)]
                               Year
Genome          ~1880000 66171 1932
Proteome        ~63,000   703   1995
Transcriptome    3520     72    1997
Physiome          2980    15    1997




                                                                                                                             28 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
Metabolome        349     12    1998
Phenome           4980     6    1995
                                                                    Proteome
Morphome          238      2    1996
                                 PubMed Hits
Interactome        56      2    1999
Glycome            46      1    2000
Secretome          21      1    2000




                                                                                                                              28 (c) Mark Gerstein, 2002,
Ribonome           1       1    2000
Orfeome            42      -     -
Regulome           18      -     -
Cellome            17      -     -
Operome            8       -     -
Transportome       1       -     -
Functome           1       -     -                                                     Do not reproduce without permission
                       Evolution of Science



                                                                                  Map based on ―bursty‖
                                                                                  words in life sciences
                                                                                  publications since 1980.
                                                                                  Older fundamental research
                                                                                  (center) led to four different
                                                                                  areas (subgraphs in
                                                                                  corners).

                                                                                   This and other domain
                                                                                   maps at
                                                                                   http://www.scimaps.org.




                                                                                                                               29 Lectures.GersteinLab.org
                                                                                                                      29

•K. Mane, K. Börner (2004). Mapping topics and topic bursts in PNAS. 101 (Supplement 1): 5287.
                                                                                         Do not reproduce without permission
   RNAi:
Birth of a
  Field in
     the
Literature
 Culmin-
  ating in




                                       30 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
 the 2006
   Nobel




                                        30 (c) Mark Gerstein, 2002,
       Source:
 Gerstein & Douglas.
PLoS Comp. Bio. 3:e80
       (2007)
PubNet.GersteinLab.org
 Do not reproduce without permission
The Social Dynamics of Innovation and
        Scientific Discovery:
      Making Models of Science

                                     The Patterns of
                                     Discovery and the
                                     Spread of Ideas as
                                     Epidemics.



                                     The population dynamics of authors in an emerging
                                     field is well described by models similar to those of
                                     epidemics, but that take into account contact
                                     processes and intentionality characteristic of human
                                     social dynamics. Panel (a) shows a SEIRZ model,




                                                                                               31 Lectures.GersteinLab.org
                                     (b) its best solution applied to the spread of
                                     Feynaman diagrams in the USA, Japan and the
                                     Soviet Union, and (c) details the model parameter's
                                     interpretation. The spread of ideas is characterized
                                     by relatively low contact rates (compared to
                                     infectious diseases), and very long lifetimes for the
                                     idea, as well as intentional strtuctures to promote
                                     interaction between individuals during the learning
                                     process.




                                                                                   31

                  [Bettencourt et al., Physica A (2006)] Do not reproduce without permission
         Examples Illuminating Current State of Affairs:
         Mashing up the Text from Scientific
         Publications with other information
           sources to make Science more
                   Understandable
• Mashing up scientific texts with
  streamed video, genome annotation,
  protein structure & interactions
• SciVee
    http://www.scivee.tv
    Partnership: NSF, PloS, San Diego Supercomputing
      Center
    Pubcasts—video correlated with PLoS papers




                                                                                                                 32 Lectures.GersteinLab.org
      automatically displayed as video runs
    Videos—scientists upload their own without papers
• Journal of Visualized Experiments (JoVE)
    http://www.jove.com
    Monthly issues of theme-related videos
    Procedure walk-throughs, interviews
    High-quality video and sound
                                                    [Bourne et al., PLOS Comp Bio (‗08)]
                                                                           Do not reproduce without permission
            Fusing Data & Papers
           to Annotate the Genome
• Ideal project for 21st century is annotating every base
  of the genome
   Want to attach all publications and results to the genome
   "Fly through Genome" as way to access and understand the
    literature




                                                                                                    33 Lectures.GersteinLab.org
• Problem of a good browser....




                                        [Gerstein, Science ('00), Nature ('06)]
                                                              Do not reproduce without permission
Impediments to the
      Vision




                                                      35 Lectures.GersteinLab.org
                Do not reproduce without permission
• Need to perform a
                                           DB
  “distributed query” over           Interoperation
  many sites                          & Federated
   Conventional web links             Information
   More complex interfaces           Architecture
• Annotation of the human




                                                                                   36 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
  genome involves a massive
  federation of interoperating
  servers
   "Administered" by many




                                                                                    36 (c) Mark Gerstein, 2002,
    disparate people and groups



[Smith et al., BMC Bioinfo. ('07)]
                                             Do not reproduce without permission
 Impediment #1:
  Structuring the
   Information
Correctly for Large-
   scale Query




                                                       37 Lectures.GersteinLab.org
                 Do not reproduce without permission
                               Structuring the Information
    EF2_YEAST                            Curated DB entries in Uniprot
                                         vs Unstructured scientific text
                                 Structured Semantic Web [Berners-Lee et al, Sci.
                                  Am. (2001)] vs purely unstructured text mining
  Descriptive Name:
  Elongation Factor 2




                                                                                                              38 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
   Lots of references
       to papers

   Summary sentence




                                                                                                               38 (c) Mark Gerstein, 2002,
   describing function:
This protein promotes the
      GTP-dependent
    translocation of the
nascent protein chain from
 the A-site to the P-site of
       the ribosome.           [Smith & Gerstein, Science ('06), Tech Rev. (Jul. '07)]
                                                                        Do not reproduce without permission
      Other Issues with the Current
   Situtation between DBs & Journals
• Not always a clear linkage between papers & DBs
   Keeping entries in DB and paper in sync




                                                                                                                 39 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
• Data aliquot
   Huge datasets are handled but what of isolated facts
• How to connect key attributes of Journals with DBs
     Attribution for credit & accountability
     Time stamping of unchanging entries




                                                                                                                  39 (c) Mark Gerstein, 2002,
     Citation and history
     Well worked out process of QC via refereeing and editing
• Readability of Papers
   Detailed data embedded into papers, making text hard to read
                    [Gerstein, Bioinformatics ('99); Gerstein & Junker. Nature Yearbook ('02)]
                                                                           Do not reproduce without permission
    Structured Abstract Proposal as a
              Compromise
• Storing information in papers in machine interpretable
  fashion
   for automatic deposition into DBs
   Abstract + standardized view of all tables
• Cross-referencing it with a specific part of the global
  genome, proteome, and interactome
   Article written as annotation from the start
• Done in parallel to submission & revision of normal
  journal article
   Refereed & edited normally




                                                                                                               40 Lectures.GersteinLab.org
   Capitalizes on peer review & incentives to publish
• Curators vs editors
   Author is in control and this process
   But it‘s officiated by referees and editors
                          [Seringhaus & Gerstein, FEBS ('08); Gerstein et al., Nature ('07)]
                                                                         Do not reproduce without permission
                                      [Seringhaus & Gerstein, BMC Bioinformatics (2007)]   Ex. Structured Abstract




Do not reproduce without permission




    41 Lectures.GersteinLab.org
                                                     • K.lactis (species)
                                                            KlSTE4 (gene)
                                                              • KlSte4p (protein)                             Ex. Structured
                                                                  – CLONED
                                                                       » Available at …                          Abstract
[Seringhaus & Gerstein, BMC Bioinformatics (2007)]



                                                                  – SEQUENCED
                                                                       » Sequence
                                                                           ATGTACGCTATAGGC….
                                                                  – MUTANTS                                      KlGPA1 (gene)
                                                                       » DELETION                                  • KlGpa1p (protein)
                                                                       » FUNCTIONAL ASSAYS                             – INTERACTIONS
                                                                       » Sterile in both MATa and                           » TWO-HYBRID
                                                                           MATα                                             » KlSte4 = XXX
                                                                       » No defect in vegetative                   • KlGpa1p* (protein)
                                                                           growth                                      – INTERACTIONS
                                                                       » STRAIN INFORMATION                                 » TWO-HYBRID
                                                                       » Available at….                                     » KlSte4 = XXX
                                                                  – INTERACTIONS                                 KlGPA2 (gene)
                                                                       » TWO-HYBRID                                • KlGpa2p (protein)
                                                                       » KlGpa1p (10x stronger) =                      – INTERACTIONS
                                                                           XXX
                                                                                                                            » TWO-HYBRID




                                                                                                                                                                                   42 Lectures.GersteinLab.org
                                                                       » Control (no partner) = XXX
                                                                                                                            » KlSte4 = XXX
                                                                       » KlGpa1p* = XXX
                                                                       » KlGpa2p = XXX
                                                                                                          • S.cerevisiae (species)
                                                                                                                 SCGPA1 (gene)
                                                                       » ScGpa1p = XXX (S.
                                                                           cerevisiae)                             • ScGpa1p (protein)
                                                                  – COMMENTS                                          – INTERACTIONS
                                                                       » Both KlSte4p and KlGpa1p                          » TWO-HYBRID
                                                                           required to induce mating in                    » KlSte4 = XXX
                                                                           K.lactis
                                                                                                                                             Do not reproduce without permission
              Unsupervised Textmining
          vs Manually Curated and Structured
         Documents: Not necessarily a conflict
  • Relatively small
    numb. Of structured
    abs. might be good
    training sets for
    mining
  • Also, gateway to
    mining (e.g. listing std.
    names for genes as




                                                                               43 Lectures.GersteinLab.org
    cast of char.,
    highlighting
    foreground v.
    background concepts)


[Smith et al., Bioinformatics ('07)]     Do not reproduce without permission
   Impediment #2:
Access Restrictions
 Inhibit Large-scale
        Query




                                                       44 Lectures.GersteinLab.org
                 Do not reproduce without permission
     Absence of social framework for
      protecting "data" on the web
• Researchers unclear on framework
   The ambiguity of the present copyright laws governing the




                                                                                                 45 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
    protection of databases creates a situation where researchers
    are (practically) unclear about their rights to extract and
    combine data
     • Putting articles up on sites, "quoting" annotation
   Likewise, researchers are unsure how to get "credit" for
    combined data ("Mash ups")




                                                                                                  45 (c) Mark Gerstein, 2002,
     • Disincentive to data integration
• Information owners, unsure of how laws safeguards
  their information, overprotect their data with licenses
  and technological mechanisms that impede
  interoperation.
                           [Greenbaum & Gerstein, Nat. Biotech. ('03)]
                                                           Do not reproduce without permission
           Technological safeguards
               to "protect" data
• Limits on Bulk Downloads &         • Databases can be stored in
  Global Analysis                      propriety formats
   Passwords and IP filtering            Extreme is encryption
     • allow the database owner       • Watermarking adds overt or




                                                                                                     46 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
       to limit access to specific      hidden digital fingerprints
       users and computers
                                          Slightly corrupting the data.
     • selectively cut off access
                                          Not that common in bio-DBs
       to researchers performing
                                           (but found in British Library).
       bulk calculations.
   Data can also be presented




                                                                                                      46 (c) Mark Gerstein, 2002,
    piecemeal, in response to a
    specific user query
   Examples
     • Incyte Proteome database
     • Cellzome database of
       interactions.
                                [Greenbaum & Gerstein, Nat. Biotech. ('03)]
                                                               Do not reproduce without permission
                   Free text Issue is Part
                   of this Larger Context
    • Different traditions in academic publishing vs DB world
        Genome sequence is free
        but have to pay for article about it!
    • Many free text initiatives




                                                                                                     47 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
        PubMedCentral.NIH.gov & arXiv.org
    • Tricky economics of free text
        potentially efficient
        but redistributes dollars in world of academic publishing
        who pays: readers or writers




                                                                                                      47 (c) Mark Gerstein, 2002,
[Greenbaum et al. (2003) Interdiscip Sci Rev 28: 293-302.]     Do not reproduce without permission
  Impediment #3:
     Security
  Considerations
Inhibit Large-scale
       Query




                                                       48 Lectures.GersteinLab.org
                 Do not reproduce without permission
                           [Greenbaum et al., Nat. Biotech. ('04); Smith et al., GenomeBiol. ('05)




                                                                                                                 West" Internet
                                                                                                     Vast Computer Security Costs in the "Wild




Do not reproduce without permission




     49 (c) Mark Gerstein, 2002,
    49 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
                                                                           Vast difficulty in securing information
                                                                                     servers in academia
                                                                          • Mundane administration — patches
[Greenbaum et al., Nat. Biotech. ('04); Smith et al., GenomeBiol. ('05)




                                                                          • Make building intricate systems for interoperation




                                                                                                                                                                          50 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
                                                                            difficult, as researchers have to continually check their
                                                                            interfaces for "holes"
                                                                          • Unique impact on research (vs business)
                                                                             Free and broad dissemination of ideas between labs and
                                                                              public is hallmark of research.




                                                                                                                                                                           50 (c) Mark Gerstein, 2002,
                                                                             Preserving openness precludes standard security practices
                                                                              often employed in a corporate or military environment -- e.g.
                                                                              private networks
                                                                             Academic computer users exhibit great variability, making
                                                                              effective security procedures more difficult

                                                                                                                                    Do not reproduce without permission
 A Vision for Harnessing the Volume of Information
   on the Web to Study the Structure of Science

• Main Applications of Large-scale         • Impediments Large-
  Mining                                     scale Mining
   New Scientific Discoveries               (as Distributed Query)
    (not disc. here)                           (Semi) Structuring
   Understanding Areas of Study through        the Information in
    Simple Zipf Stats                           Journals
     • Crystallography Nobel, Genomics,        Overcoming access
       Gene Naming                              restrictions
   Maps of Science                            Security




                                                                                                51 Lectures.GersteinLab.org
     • Studying a genomics consortia,           Considerations
       Bigger Map, Ranking Journals
   Dynamics of Science
     • Watching and modeling the
       appearance of new terms, RNAi
       ex.
                                                          Do not reproduce without permission
                         MS




                                                                         52 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
                              MG




                                                                          52 (c) Mark Gerstein, 2002,
 Acknowledgements
Acknowledgements
TopNet.GersteinLab.org
                                   Do not reproduce without permission
                                                               Acknowledgements
                                      TopNet.GersteinLab.org
                                                                                       MS




                                                                                  MG




Do not reproduce without permission




     53 (c) Mark Gerstein, 2002,
    53 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
                                                               Acknowledgements
                                      TopNet.GersteinLab.org
                                                                                       MS




                                                                                  MG




Do not reproduce without permission




     54 (c) Mark Gerstein, 2002,
    54 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
   NIH, NSF, Keck

                         MS
                         M Seringhaus




                                                                                             55 Lectures.GersteinLab.org Yale, bioinfo.mbb.yale.edu
 Job opportunities
   currently for                           MG
    postdocs &           K Cheung       D Greenbaum
     students
                             M Schultz




                                                                                              55 (c) Mark Gerstein, 2002,
    A Rzhetsky             G Montelione

                          S Douglas A Smith
 Acknowledgements               K Yip
TopNet.GersteinLab.org
                                                P Cayting
                                                       Do not reproduce without permission

								
To top