Shankar Subramaniam - Home Sociogenomics Initiative by wuyunqing


									The Sociogenomics
  A prototype environment for
scalable, extensible, portable and
seamlessly accessible information
Core B Proposal: Aim 1

 a strategy for upload of data into the central
    server (SDSC/NCSA)

 • laboratory information management system

 • development of web interfaces (with
   background ftp scripts) for data upload

 • data formats/tools for standardization
Core B Proposal: Aim 2

 Development of the Glue Grant Website

 • - web design

 • - web development

 • - User Interfaces
Core B Proposal: Aim 3

 databasing and data storage (backup
 • - development of required ontologies
 • - development of schema and
   implementation of data tables (assume we
   will use Oracle DB)
 • - development of middle layer for
   query/input/output tools
Core B Proposal: Aim 4

 Development of query tools

 • - User Interfaces

 • - query engines and background SQL

 • - display of query outputs/formats/GUIs
Core B Proposal: Aim 5

 Development of dissemination/display tools
   for data and analysis

 • - integration of analysis tools developed in
   Core C

 • - integration of analysis outputs from Core
   C into the website

 • - Development of data and analysis
   download tools
Core B Proposal: Aim 6

 Community pages for enrichment of the data

 • - Community input pages

 • - Wiki?

 • - integration of public data and analysis
   (who will curate?)
Core C Proposal: Aim 1

 Proposed Informatics Pipeline(Responding to
    critique that the multiple omics were not
    well integrated A. Primary data:
 1) Transcriptomics (RNA-seq) on specific brain
    regions or neuron populations
 2) ChiP-seq for a set of 10 pre-determined
    (from literature) and 10data-driven TFs
 3) Promoter scans for cis-regulatory elements
 4) epigenomics (BS-seq)
Core C Proposal: Aim 2

 B. Build Transcriptional Regulatory Networks
    from 1-4 of Aim 1
Core C Proposal: Aim 3

 C. Test predictions with:RNAi and relevant
    drugs, followed by:

 -Targeted proteomics
 -Targeted metabolomics
 -hormone measurements-behavioral analysis
Core C Proposal: Aim 4

 D. Conduct within and cross-species
    comparisons of Networks with existingand
    new analytical tools.

 Focus on "module" as the unit of analysis?
 What is conserved (universal)?
 What is species specific?
 What varies genotypically within species?
                Core B
Infrastructure for the SocioGenomics
 Overall Mission: To create a comprehensive
  information infrastructure that will
 - Enable persistent archiving of information
   pertaining to SGP
 - Provide a seamless interactive infrastructure for
   SGP researchers
 - Create a state-of-the-art SGP Workbench that
   will contain tools, databases and interfaces
 - Develop an overarching resource that will enable
   total research administration of SGP initiatives

 The infrastructure will be based on a Hub and Spoke Model.
       Hub & Spoke Model of SGP-IT

              Research                 Core
               Centers              Technology

Individual                 SGB-IT             Public
 Research                   HUB             Repositories

          Administration              Curations
Information Flow in SGP-IT

    LIMS                                            Experimental

                        Core Information
 Informatics                                         Clinical data

     Research Centers                        PI Laboratories


Informatics System Architecture
The SGP-IT WebPortal

Create a Web Portal that will serve as the entry point
           to the SGP-IT infrastructure
  Website Users

• Anonymous Public
• Registered Public
• SGP Personnel
• Content Contributors
• Website Developers
 Website Developers
        •Overall Development
        •Cellular Data
        •Phenotype Data
        • Other Experimental Data
        •Clinical Data
        •Database Search Utilities

The organization of the website developers
as independent groups developing separate
sections of the website leads to a diverse
collection of pages and programs rather
than a centralized program.
       Website Content
Public Areas       Restricted Areas
•About CalSCI      •Access Control Administration
•Data              •Data Administration
•Home              •E-mail Aliases
•Login             •Progress Reports
•Membership            •View
•Stem Cell Pages       •Upload
•Protocols         •System Committees
•Reports           •Message Boards
•Pathway Maps      •Barcode
•Sponsors          •Data
                   •Review Committees
    Access to SGP-IT

Who will have access and how will it be regulated?
 Overview of Access Control Layers

  HTTP request          HTTP response
   from user             from server

                                           Access Control Layers

                                           1. Web server layer:
                                           General access control for
         Apache mod_perl                   all users of the website

                                           2. Application layer:
                                           More specific access control
                                           and personalization for
Static           Perl            Java      users of each application
HTML             CGI            servlets   (username can be read from
                                           cookie set by Layer 1)

                                           3. Database layer:
                                           Database gives different
          Oracle database                  access privileges to each
      Access Control in SGP-IT Gateway
                    Completely open

Login required
Also some special
sections for sole use                        Home &
of:                                        Info Pages
•Review Panel
Staff                                       Reviews

                            Experimental                 Clinical
                               Data                       Data

                                       Oracle Database

                        No direct human access (except for sysadmin)
Laboratory Information
 Management System

Creation of a LIMS for mandatory use: Importance of
             tracking and validating data
               CalSCI-IT Laboratory
             Information Management
                  System (LIMS)
                                              Sample tracking   Barcode

                                                       Data Integrity/ Storage

IP Barcode printer                                    Data Input GUIs
                         Sample Preparation
                         Sample Processing
           Barcode       Sample Handling

           Test tube/
           Sample vial
         A barcode printer & a barcode scanner

  Zebra Z4M barcode printer - 300DPI - 4MB Total - 2MB Usable
  Zebranet Print Server II, External, Parallel

  Cyclone Scanner USB & Synapse Adapter
SGP-IT Data Center

Automatic Pipelining of Data from Research
  Laboratories in the Stem Cell Initiative
    Transfer of experimental
    Experimental data files transferred on a periodic basis using
    UNIX wget utility under control of cron job

    UCSD                          Phenotype      Cellular         UCB

                              SGP-IT                  Phenotype

                 microarray       Transfection
    Other                                                     U. Penn
      Parsing log files / loading db
  •   Set of cron jobs
        run each night. Names of newly uploaded files passed as arguments to
      loader scripts.
  •   Log files created so that experiment schemas can be easily repopulated

                    Data master            Data loader         Data@CalSCI
                    Data master            Data loader         Data@CalSCI
auto-ftp log
                    Data master            Data loader         Data@CalSCI
                    Data master            Data loader         Data@CalSCI
           Graphics generation
•   Scripts generate libraries of static images, html, tab-delimited files. Data
    pulled from experiment and barcode schemas


data@CalSCI                         process                         data/
data@CalSCI                         process                         data/
data@CalSCI                         process                          data/
data@CalSCI                         process                         data/
•   Argument passed to CGI script to specify graphics, etc. to be

       data/Type                                      Text, graphics,
                                                      links, etc. for a
                              cal_data.cgi            given data type
                                                       with specific
     Data/specific                                          data
Example Data Flow
The SGP Data and
Knowledge System

   Data Analysis Infrastructure
  Biology: Genotype to Phenotype



AND NETWORKS                   MODELS

  Analysis Infrastructure
          Infrastructure for Data Integration and Analysis


Functional Genomics     Systemic Annotation      Biochemical Pathways
    Workbench               Infrastructure            Workbench
    Transcriptome &

                         Biological Projects
    Bioinformatics Data

  Gene Data

Transcriptome                  Query &
                  Database      Analysis
Phenotypic Data              Infrastructure

  Cellular Data

Metabolome Data
     The Parts-List: SGP Strategy


                Cell-Specific       Transcriptome
                   Genes            Stem Cell Pages

                                   Signaling Proteins
                                   Invoked in Input-
Gene Products
                                   Specific Response

                 Gene Products
                Invoked by Input
                   Reconstructing Networks

                               Reconstructing Networks

Legacy Data                                                             CalSCI Data

               Protein Interactions                Microarray Data

              Biochemical Pathways              Yeast Two-Hybrid Data

              Published Literature                       RNAi Data

                                                     Protein Data

                                                   Perturbation Data

                                                   Microscopy Data
Example Annotation pipeline
                                Protein Class

   GO                                                                  Other DB
                  Unique proteins with Swiss-Prot IDs

                           Protein annotation

Function                        Genes                     Enzymes
        Domains                     Locuslink                  E.C. classification

        Structures                  Chr. Mapping               Reaction

        Sub-cellular location       RefSeq                     Substrates

        PTM                         Mutation data              Products

        Disease association         Phenotype variation        Compound data
                                 The SGP-IT Workbench

                           Sequence                              Molecule         Interaction
                          Annotation                              Pages            Modules

                             Motif             Signaling         Organism                               Cell & Tissue   Molecular
                            Libraries          Pathways                                                  Specificity    Phenotypes

                        Data Structure                               Pathway               Expression    Proteomics     Interaction
                                                                       GUI                  Profiles      Profiles        Profiles
         Database                  Temporary               Editing        Comparison
                                    Storage                 Tool             Tool

Query Tools   Analysis Tools

                                     Legacy            Signaling
                                    Pathways           Networks

                    Data Store          Bulletin     Annotations            Other
                                         Board         System              Analysis
    A Model Computing
Infrastructure for the Stem
    SGP IT Server Core
   An Exemplar SGP-IT Computer Infrastructure
           Internet                    100T public Network
                                                                        1000T Private Network

                    Compute        Sun 420R           Sun E6500          Sun 420R        Sun 420R
                     Cluster     (4) * 450mhz        (8) * 450mhz      (4) * 450mhz    (4) * 450mhz
                  INTEL-PCs        4G RAM              8G RAM            4G RAM          4G RAM
                   nX32 Proc         Web               Web S+            Compute            Web
                    nX64GB        Application          Compute            Server         Application
                                    Server              Server                             Server

Sun V100
                                                                                        Sun V 440
  Web                                                        SAN connection            (4) * 450mhz
 Server                                                                                  4G RAM
                     Sun 420       Sun V480
                  (4) * 450mhz     4 900mhz
                                                     5 TeraBytes       5 TeraBytes
                                  16 GB RAM
                      Computer     CVS server
                                                       RAID-5            RAID-5          680 GB
                       Server      SRB Server           T3ES              T3ES           RAID-5
                                   NFS Server                                              T3
  SGP Informatics and
Systems Biology Proposal

         Cores B/C
      What should the
 experimentalists interaction
Creation of a Communication and Networking
Infrastructure across Core Institutions/Labs.

  • A: Videoconferencing Infrastructure:
  • B. Networking Infrastructure: Fiber,
    other connections between institutions.
  • C. The SGP Website
  • D. Web Portal for instant
    messaging/conferencing /scheduling.
Creation and Maintenance of

  •   cell samples – appropriate stored and
  •   bar coding and tracking system
  •   data base containing anonymized and
      access controlled meta data
  •   annotations/notes on the cells and IDs to
      link to data
Creation of Laboratory Information
Management Systems.

  •   Development of ontologies and meta data
      catalogues in consultation with experimentalists.
  •   Barcoding, printing and sensor systems and
      development of software interfaces.
  •   Pipelining and validation of bar codes and meta
  •   Creation and maintenance of LIMS databases and
  •   Need detailed catalogued experimental protocols.
Data Management, Sustenance and
 •   design of information system
     –   what types of data will be accrued?
     –   How will they be collected and formatted?
     –   What are the ontological definitions?
     –   What relational databases are suitable
     –   How will data be entered into databases?
     –   What will be the type of queries and what will be good interfaces?
     –   How will the database be access controlled?
     –   What will be the links between different databases?
     –   How will interoperability be achieved?
     –   What tools are needed to process the data?
     –   What types of statistical tools will we need?
     –   How will the data be stored and made accessible for other repositories?
     –   What will be the links with national repositories?
     –   Who will authenticate the data?
 •   Development of interfaces and analysis tools
     –   What types of interfaces are necessary?
     –   How interactive will the objects in these interfaces be? Will a client-server model work?
     –   How will analysis tools be made available? How will these be tied to biology?
     –   How will we deal with image representations?
 •   Development of interoperability systems
     –   Design of agent-based systems
     –   Integration/interpretation of ontologies
     –   Legacy knowledge incorporation
     –   Scientific annotation and literature embedding.
From genes to pathways to phenotypes .

 •   Develop a parts list database for stem cells along with
 •   Link gene function and polymorphisms to gene function,
     pathways and networks involved in cell differentiation and
 •   Identify cellular media conditions and their relationships to
 •   Use network reconstruction as hypothesis engine for
     knockouts and perturbations
 •   Use statistical tools for analyzing stem cell differentiation
 •   Develop links between transcription profiles, development and

To top