Docstoc

The Cancer Genome Atlas Project

Document Sample
The Cancer Genome Atlas Project Powered By Docstoc
					The Cancer Genome Atlas Project

        January 24, 2008




                              TCGA
                    Program

• Goal: find genomic alterations that cause
  cancer (mutations, CNA, methylation, …)
• Pilot project
  – $100M (NCI/NHGRI)
  – 3 years
  – 3 diseases
     • brain (glioblastoma multiforme)
     • lung (squamous)
     • ovarian (serous cystadenocarcinoma )

                                              TCGA
               Organization

• Biospecimen Core Resource (BCR)
• Genome Sequencing Centers (GSCs) (3)
• Cancer Genome Characterization Centers
  (CGCCs) (7)
• Data Coordinating Center (DCC)
• Project Team (NCI/NHGRI)
• Steering Committee (NCI/NHGRI & PIs)
• External Scientific Committee
• Working Groups


                                           TCGA
                     PI’s
BCR    IGC/TGEN        Robert Penny
GSC    Baylor          Richard Gibbs
       Broad           Eric Lander
       WashU           Rick Wilson
CGCC   Broad/DFCI      Matthew Meyerson
       Harvard/B&W     Raju Kucherlapati
       JHU             Steve Baylin
       LBL             Joe Gray
       MSKCC           Marc Ladanyi
       Stanford        Rick Myers
       UNC             Chuck Perou
DCC    SRA             Ari Kahn




                                           TCGA
                    URLs

• project site: http://cancergenome.nih.gov
• gforge: http://gforge.nci.nih.gov (search for
  TCGA)
• data: http://tcga-data.nci.nih.gov
• portal: http://tcga-portal.nci.nih.gov
  [coming]



                                           TCGA
                      Data Types
  Institution         Analysis                         Platform

 Broad/DFCI     Transcription and Copy   Affymetrix U133 Plus 2.0 & SNP Array
                       Number                             6.0

Harvard/B&W     Transcription and Copy            Agilent 244K Array
                       Number
     LBL            Transcription            Affymetrix Exon 1.0 ST Array

   MSKCC            Copy Number                   Agilent 244K Array

     JHU             Methylation                 Illumina GoldenGate

    UNC             Transcription                  Agilent 44K Array

   Stanford         Copy Number          Illumina Infinium 550K BeadChip Array


    Broad         Somatic Mutations                DNA sequencing

    Baylor        Somatic Mutations                DNA sequencing

   WashU          Somatic Mutations                DNA sequencing

                                                                            TCGA
                    Data Levels
• raw
  – low-level data for a single sample, not normalized
    (e.g., trace file, .cel file)
• processed
  – single-sample, normalized & interpreted (e.g.
    mutation call, amplification call for a locus, .snp, .chp)
• segmented (n/a for mutation & expression)
  – single-sample, aggregation of loci into regions (e.g.
    amplification call for a region of a sample)
• summary finding (aka “region of interest”)
  – cross-sample findings (e.g. minimal common region
    of amplification across a sample set)

                                                        TCGA
                              Flow
                                                            BCR
   Tissue Source
                                           1. check pathology, quality/quantity
(MD Anderson, Henry
                                           2. extract analytes
      Ford, …)
                                           3. prepare data file


                                      sample                          DNA
                                      data            DNA,
                                                      mRNA

                                               CGCC                   WGA



 “tracking
                      DCC                                              GSC
database”



  Bulk        caTissue                                              NCBI Trace
                            caArray       caIntegrator
Download        Core                                                 Archive


                                                                        TCGA
                   Data Formats

• BCR
  – XML (tags are CDEs)
  – images
• GSC
  – Called mutations (Genboree LFF format)
  – Linking table
     • sample-trace-target
• CGCC
  – MAGE-TAB
     • IDF: Investigation Definition Format
     • SDRF: Sample and Data Relationship Format


                                                   TCGA
          Where Does/Will the Data Go?
•   ftp site (now with a simple web wrapper: “portal #1”)
•   “tracking database”
•   repositories with caBIG API’s
     –   caArray
     –   caTissue CORE
     –   caIntegrator
     –   NCIA
•   NCBI trace archive
•   a richer, “portal #2”
     –   more convenient download capability
     –   filtering datasets by clinical information
     –   summary level data
     –   genome browser view
     –   gene info page
     –   visualization on pathways
     –   etc.




                                                            TCGA