; The problem with Copy Number Variations - Wellcome Trust Centre
Learning Center
Plans & pricing Sign in
Sign Out

The problem with Copy Number Variations - Wellcome Trust Centre


  • pg 1
									Copy Number Variations

      DTC BioInformatics Course
         Hillary Term 2012
    WTHCG, Tuesday 7th of February

  Jean-Baptiste Cazier
  Head of Statistical Genetics and Functional Genomics Analysis


•   Lecture
     – Definitions
          •   More important than it may seem
     – Identification
          •   Technology, Algorithmic, Design
     – Classic studies
          •   McCarrol & Korn, GSV, WTCCC, Obesity

                                                     Short Break

     – The special case of Cancer
          •   More problems
     – High Throughput Sequencing
          •   Cancer Case again
     – Conclusions
                                                     Lunch Break

•   Practical
     – OncoSNP
     – Applications with CGH and SNP data in R
     – CNV integration

•   Acronyms:
     – CNP:
          •   Copy Number Polymorphisms
     – CNV:
          •   Copy Number Variations
     – CNA:
          •   Copy Number Aberrations
          •   Copy Number Alterations

                                                             Finding the missing heritability of complex diseases
                                                             TA Manolio et al. Nature 461, 747-753 (2009)

•   Creation: Germline vs Somatic
     – Is the CNV coming from the original cell or did it evolve only in a few ?
          •   There are very many CNVs shared among population like SNPs or STRs
          •   Somatic propagation of CNVs is a mark of Cancer


•   Genome-Wide Association provided
    some success in the identification of
    variants for many diseases:
     – AMD, Coeliac disease, Type 2
         Diabetes, Prostate Cancer,
         Colorectal Cancer, etc.

•   However most variants are ‘only’
    statistically significant:
     – 80% fall outside of coding regions

•   The case of Missing Heritability:
     – Whatever the number of variants
        identified, they usually account for   Finding the missing heritability of complex diseases
                                               TA Manolio et al. Nature 461, 747-753 (2009)
        only a small proportion of the

                            Missing Heritability

•   Need to find other “reasons” to explain the difference.

•   Heritability definition
     –    Proportion of phenotypic variance attributable to additive
          genetic factors

•   The Common Variant Common Disease model is
    challenged                                                              Feasibility of identifying genetic variants by risk allele
     –    Look for more markers                                             frequency and strength of genetic effect (odds ratio).
             •   Rarer with strong effect
             •   Common with lower effect
             •   Gene-Gene interaction
             •   Shared environment
     –    This is essentially a question of power
             •   Groups are joining forces in very large consortium
             •   Better technological coverage of the rarer variants
     –    More variant types
             •   Copy Number Variation
             •   InDels, Segmental Duplications.

•   Comparable phenotyping in meta analysis ?

•   The ‘Dark Matter’                                                  Finding the missing heritability of complex diseases
     –    Does it really exists ?                                      TA Manolio et al. Nature 461, 747-753 (2009)
     –    Can we see it beyond its influence ?
                         Gain, Loss, etc

•   Normal:
     – 2 chromosomes are inherited, one from each parents

•   Deletion:
     – Homozygous: 0 copy left
     – Hemizygous: 1 copy left
     – Sizeable event:
           => Not an InDels

•   Gain
     –   Can be 3, 4, 5, … copies
     –   Most often nearby duplication, but not always
     –   Sizeable event:
     –   Not Line, Sine, repeats, etc.

                                                Copy Number Variation in Human Health, Disease, and Evolution
                                                Zhang F et al, Ann. Rev. of Gen. and Hum. Gen. 2009 (10) 451-481

•   Copy Neutral Loss of Heterozygosity
     – Not Copy Number Polymorphism per se, but needs to be addressed
                        CNV in color

SNP array   +   +   -     +      +     +       +                          +               +
                                                   Chromosome aberrations in solid tumors
                                                   Donna G et al. Nature Genetics 34, 369 - 376 (2003)

  a) Aberrations leading to aneuploidy.
  b) Aberrations leaving the chromosome apparently intact

                                                                   •   4 main mechanisms in the
                                                                       generation of CNV:

                                                                        – NAHR
                                                                            •   Non-Allelic Homologous Recombination

                                                                        – NHEJ
                                                                            •   Non-Homologous End-Joining

                                                                        – FoSTeS
                                                                            •   Fork Stalling and Template Switching

                                                                        – L1 retrotransposition

Copy Number Variation in Human Health, Disease, and Evolution
Zhang F et al, Ann. Rev. of Gen. and Hum. Gen. 2009 (10) 451-481

•   Identification: a Genome-Wide test
     – Karyotyping
     – Spectral Karyotyping (SKY)

     –   Comparative Genetic Hybridization (CGH)
     –   Array CGH (aCGH)
     –   “SNP”- array
     –   High- Throughput Sequencing

•   Validation: a local test
     –   qPCR: quantitative Polymerase Chain Reaction
     –   MLPA: Multiplex Ligation-dependent Probe Amplification
     –   Fluorescent In-Situ Hybridization (FISH)
     –   Sequencing

                      Array technology

•   Array CGH
    – Agilent, Nimblegen
    – 2 channels: compare hybridization level to a common background reference
    – Usually 42 million probes genome-wide
        •   Resolution up to 200bp

•   SNP array
    – Illumina, Affymetrix
    – Test one or few samples at a time
    – Initially developed for genotyping
        •   2 channels: allele A/B
    – Increasing density of markers
        •   From 10,000 Linkage SNPs
        •   Up to 5M SNPs and CNV probes

                SNP-array signature

 •   Sample data for a number of different copy number and LOH events.
     – The Log R Ratio scales with copy number
     – The distribution of the B allele frequency is governed by a more complex
       relationship with allowable genotypes.


Real data                                                            Neutral
            Copy Number Loss

SNP array


       Copy Number Loss and Gain

SNP array


            Mixed Cell Population

SNP array


            Copy Neutral LOH

SNP array


      Automatic recognition of CNVs

•   Originally done by visual inspection
     – Problem of reproducibility
     – Problem of accuracy
     – With increasing density, problem of possibility to see

•   Automation and test
     – Moving average
     – Probe selection / compilation
     – Segmentation, Hidden Markov Model
     – Significance testing

•   Need to compile data with uncertainty

Moving average

  Automatisation by use of Hidden
          Markov Model
• Select automatically the optimal Copy Number sequence
  over a chromosome to fit the Model

• Evaluate the probability of the sequence of intensity
  signal fitting this model
   – Can test various models and select the most appropriate

• The Model can be trained simply by feeding “typical”
  data sets                                                         2
   –   Look for minimum number of changes                           0

   –   Look for maximum instability                                 2

       Select a most likely default state
   –   …

• Definition:                             •    Start Value:
    – Find the underlying states giving              (P(0), P(1), P(2))
      the observation                     •    State Transition:
    – Underlying states are the number of             (P(0|0), P(1|0), P(2 |0),
      copies: 0,1,2, …                                 P(0|1), P(1 |1), P(2 |1),
    – Observation is the Signal Intensity              P(0|2), P(1 |2), P(2 |2))
    – Defined by 3 probabilistic entities •    Emission probability
                                                     (P(Obs|0), P(Obs |1), P(Obs|2))

      2           2           2           2                                2

      1           1           1           1                                1

      0           0           0           0                                0

       Obs1        Obs2        Obs3           Obs4                          ObsN 19

CNAM employs a powerful optimal segmenting algorithm using dynamic programming to
  detect inherited and de novo CNVs on a per-sample (univariate) and multi-sample
  (multivariate) basis.
Unlike Hidden Markov Models, which assume the means of different copy number states
    are consistent, optimal segmenting properly delineates CNV boundaries in the
    presence of mosaicism, even at a single probe level, and with controllable sensitivity
    and false discovery rate.

               Available software

• Graphical Interface:           • Command line
   –   Agilent                     –   QuantiSNP
   –   Golden Helix                –   PennCNV
   –   Nexus
                                   –   BirdSuite
   –   Partek
                                   –   OncoSNP *
   –   BeadStudio/GenomeStudio
                                   –   …
   –   Golf
   –   CNAT
   –   CNAG                      • R packages
   –   dChip                       – Somatics* *
   –   …                           – DNACopy
                                   – CNVTools
                                   – Aroma
• Uneven field of quality and      – …
  specificity                      * Cancer Specific tools   21
              Development of array

•   In 2008 McCarroll and Korn published the identification of CNPs and CNVs
    using/designing Affymetrix SNP 6.0 high resolution array

               SNP 6.0 by McCarroll

•   “ We designed a hybrid genotyping array (Affymetrix SNP 6.0) to simultaneously
    measure 906,600 SNPs and copy number at 1.8 million genomic locations. By
    characterizing 270 HapMap samples, we developed a map of human CNV (at 2-kb
    breakpoint resolution) informed by integer genotypes for 1,320 copy number
    polymorphisms (CNPs)” McCarroll

•   Published both analysis with chip design and algorithm suite: BirdSuite
     – Perform both genotyping and CNV identification
     – First call for known CNP
     – Look for new CNV

•   80% of observed copy number differences due to common CNPs (MAF>5%),
•   > 99% derived from inheritance rather than new mutation.
•   Found a common deletion polymorphism in perfect LD with Crohn’s disease SNPs
     – 2kb upstream IRGM
     – Affect level of expression
               High density of probes
•   Can identify smaller events
     – E.g. Important to spot residual event in translocation/fusion genes

•   Gain confidence in SNP-regions by increasing the number of probes

•   Can get better resolutions, i.e. more accurate breakpoints:
     – Can split existing large regions into smaller ones

•   Better coverage of CNP
     – These regions were mainly not be covered by SNP-only arrays
     – Beware of overrepresentation of these regions

•   Tiling across the genome
     – More exhaustive picture
                                 Increase density

         Copy Number
     2                                                                                                10K

     4                                                                                                250K
     1                                                                                                Nsp

     2                                                                                                250K

     2                                                                                                6.0

Loss of 65Kb region confidently identified only with SNP 6.0, Bryan Young et al, Cancer Research UK   25
                        Too much data ?

                                                                              t-test on Run I

                                                                              t-test on Run II

                                                                               of I and II

    Copy Number
4                                                                            Log 2 Ratio I

2                                                                            Log 2 Ratio II

Replicates increase signal to noise ratio and avoid false positives and true negatives
But it costs twice as much !                                                      26
                    Potential Issues

•   Interpretation
     – What to use as a baseline ? i.e. define the Ratio

•   Variations in probe coverage:
     – Gaps
     – Overlapping probes

•   Inaccurate reference
     – Reference build is inaccurate
     – Probes cannot match the locus accurately

•   Systematic error
     – Autocorrelation with GC content
     – Preparation, e.g. genome amplification
Overlapping probes in regions of CNP

Probes in repeat elements

                    SNPs in probes

•   The special case of rodents:

•   There can be many strain from limited
    number of founders
     – Full sequencing has been limited
     – The reference used for the probe
       generation can be far from the strain
     – This will lead to failure across the genome

                                                     Gauguier et al, in preparation   30
        Systematic SNPs in probes

•   There can be mosaicism
     – Grouping of SNPs in specific regions
•   Generates systematic drops in hybridization at
    specific loci
•   Can be misinterpreted as deletion
     – Be aware of the regions with SNPs
         • And correct for the lack of hybridization
     – Design specific probes for the strain

                                                       Gauguier et al, in preparation   31
              Large CNV Surveys

• Two projects were run in parallel to identify and characterize CNVs
  in Human:
    – The Genome Structural Variation Consortium (GSV)
        • CNV discovery project to identify common CNVs using aCGH by Nimblegen,
        • Detection in 20 CEU, 20 YRI, 1 reference
        • Assayed in 450 HapMap samples

    – The Wellcome Trust Case Control Consortium (WTCCC)
        • Test for association to diseases of CNVs in the WTCCC
            – 16,000 cases, WTCCC plus Breast cancer
            – 3,000 common controls

The GSV study design

   The GSV study outcome

Localization     Function of CNVs

         The GSV study outcome (II)

•   Designed an array with 42 million probes
     –   cover 11,700 CNV larger than 443 bp
     –   8,599 validated independently

•   Generate reference genotype for 4,978 on 450 samples

•   Identified 30 loci with CNV candidate for influencing phenotype
•   Striking effect of purifying selection
     – Act on exonic and intronic deletions
     – So functional variants should be rare

•   But most of common CNVs are already well tagged by the existing
     – May need to look elsewhere to solve the missing heritability

                  The WTCCC study

•   Use the WTCCC cohort of 16,000 samples and 3,000 common controls.
     – Bipolar, type 1 diabetes, type 2 diabetes, coronary artery disease, hypertension,
       rheumatoid arthritis, Crohn’s disease + Breast Cancer
     – 1,500 1958 Birth Cohort and 1,500 National Blood Donor

•   Designed a specific array using GSV set, McCarroll,1M and WTCCC1
     – 104,000 probes targeting 12,000 putatitve loci

•   Perform assay using the Agilent platform by Oxford Gene Technology
    (OGT) against a common pooled reference sample

•   Attempt to design a robust pipeline to call all CNV across the different
     – Use CNVtools by Plagnol and local by Cardin (“Chiamesque”)

     http://www.wtccc.org.uk/ccc1/plus_typing_array.shtml                              36
                 The WTCCC results

•   3,900 CNV identified
          • 3,100 validated after QC

•   Concordance of 99.8% on known 420 duplicates

•   Remaining 8,000 CNVs from original selection:
     – False positive in discovery
     – Too noisy, but genuine
     – Genuine but very rare

•   19 CNVs taken forward to replication with Bayes Factor: ~10-4 p-value
     – 14 failed to replicate either using tagged SNPs or direct typing
     – 5 associations

          The WTCCC conclusions

•   Each CNV behaves uniquely
         • Size, genomic location, biological sample type, sample preparation

     – Designed 16 different pipelines
         • Key paramaters:
              –   Normalization
              –   Integration of the 10 probes
         • Impossible to define one-pipe-fits all
     – Show importance to have duplicates and large amount of diverse data

•   Confirmed the overrepresentation of CNVs in intronic regions

•   Confirm the high level of tag with SNP 6.0 or HapMap2
     – MAF > 10% : 75% tagged at r2>0.8
     – MAF <5% : 40% tagged at r2>0.8

•   Found few new CNV associated with phenotype
       Conclusions of these studies

•   Both identified many CNV in the human genome

•   Characterization of CNV is very difficult, and not easily stream lined
     – Careful interpretation of association results
     – Some artifacts will survive confirmation

•   Many CNVs co-localize with variants identified by GWAS
     – Good functional candidate

•   But, most of the common CNVs are already well tagged with SNPs
     – This will not bring new common variant in common disease
         • i.e. these will not solve the mystery of missing heritability.

•   Still rare CNVs can be associated to diseases, but just as much as SNPs
                      Success stories

•   Autism

     Pinto D et al.
     Nature. 2010

•   Obesity


Functional impact of global rare copy number variation in autism spectrum disorders.   41
Pinto D, et al. Nature. 2010 Jul 15;466(7304):368-72.

                                                           a)   Affymetrix 6.0 array data for five patients with
                                                                deletions at 16p11.2 is shown. Log2 ratios of the five
                                                                samples are highlighted in dark red, with other
                                                                samples in the same genotyping plate shown in grey.
                                                                The structure of extensive segmental duplication that
                                                                extends to the flanking regions is shown.
                                                           b)   Three probands in whom the 16p11.2 SH2B1-
                                                                containing deletion co-segregates with severe early-
                                                                onset obesity alone.
                                                           c)   Two probands harbouring larger de novo 16p11.2
                                                                deletions that also encompass a known autism-
                                                                associated locus and are associated with
                                                                developmental delay and severe early-onset obesity.

                                                           •    MLPA probes for genes in the region of interest are
                                                                shown. The MLPA target regions labelled as C are
                                                                control probes located either on chromosome 16 but
                                                                outside the deleted region or on other chromosomes.
                                                                Patient MLPA traces are in red, overlaid upon the
                                                                normal control MLPA traces in black. Arrows point to
                                                                the deleted probes.
Large, rare chromosomal deletions associated with severe
early-onset obesity.
Bochukova EG, et al Nature. 2010 Feb 4;463(7281):666-70
                             Success stories

a)   aCGH data showing the location of the 16p11.2 deletion. The data show the log2 intensity ratio for a deletion
     carrier compared to an undeleted control sample. Grey bars connected by a broken line denote the segmental
     duplication flanking the deletion region. Vertical bars indicate the positions of the probe pairs used for MLPA
     validation. Note that CGH and genotyping array probes targeted against segmental duplications may not
     accurately report copy number due to the increased number of homologous sequences in the diploid state.
     Genome coordinates are according to the hg18 build of the reference genome.
b)   MLPA validation of 16p11.2 deletions. Representative MLPA results are shown, illustrating one instance of
     maternal transmission and two instances of de novo deletions. Genotyping data excluded the possibility of non-
     paternity. Each panel shows the relative magnitude of the normalised, integrated signal at each probe location, in
     order of chromosomal position of the MLPA probe pairs as indicated in (a). Each panel corresponds to its
     respective position on the associated pedigree, as shown.

A new highly penetrant form of obesity due to deletions on chromosome 16p11.2.
Walters RG et al Nature. 2010 Feb 4;463(7281):671-5.                                                                   43
        What more with CNV then ?

•   Copy Number Variations are key in Cancer

•   Cancers are typical of somatic variations
     – They are therefore mostly unique
     – Cannot be tagged
     – Relatively common event
     – Although still difficult to identify it is essential


Schematic illustration of chromosomal evolution in
   human solid tumor progression.
The stages of progression are arranged with the
    earlier lesions at the top.
    Cells may begin to proliferate excessively owing
    to loss of tissue architecture, abrogation of
    checkpoints and other factors. In general,
    relatively few aberrations occur before the
    development of in situ cancer.
A sharp increase in genome complexity (the number of
    independent chromosomal aberrations) in many
    (but not all) tumors coincides with the
    development of in situ disease.
The types and range in aberration number varies
    markedly between tumors,
    HCT116, a mismatch repair–defective cell line
    T47D, a mismatch repair–proficient cell line64.

                                                       Chromosome aberrations in solid tumors
                                                       Donna G et al. Nature Genetics 34, 369 - 376 (2003)

                Germline vs. Somatic

•   Germline variants
     – The aberration exists from the start, and is inherited
     – Such variants are more likely to be common Copy Number Polymorphisms,
       predisposing variants.
     – Approach similar to non-cancer studies

•   Somatic events
     – Aberrations happen during the life-time
     – Happen more than once
     – Heterogeneous events;
       => Each cancer is unique

     – In Tumours, recurrent aberrations are more likely to be linked to the cancer as a
       selective advantage

      We want to identify the regions with recurrent events

                       More issues

•   Interpretation
     – What to use as a baseline ? i.e. define the Ratio

•   Within sample baseline of 2 is not an easy assumption anymore

•   Heterogeneity of tissue
     – Biopsy can be “contaminated” by normal tissue
     – Cancer are usually made up of a set of co-existing clones

•   CNVs are unique
     – Each one has its own breakpoints

•   Systematic error
     – Preparation, e.g. genome amplification
     – Sample quality                                               47
    Copy Number Variations in Cancer
•   It is possible to analyse tumour samples
    using classic Copy Number tools, but the
    results are likely to be unsatisfactory as
    many model assumptions are violated:

     – The normalisation of SNP genotyping
       data can be affected by tumour samples
       containing large scale chromosomal
     – Most aberrations do not follow the classic
       diploidy and cannot fit usual clusters
     – So Genotype Calls might be forced on
       the wrong model AA/AB/BB:
          •   Deletions should be 0 or A / B,
          •   Copy Neutral LOH should be AA/BB
          •   Triploid should be AAA/AAB/ABB/BBB
     – There can be intra-tumour heterogeneity
          •   E.g. Mix of triploid and tetraploid
                                                    Integrated genotype calling and association analysis of SNPs, common copy
     – There can be contamination with normal       number polymorphisms and rare CNVs.                               48
       cells (stromal contamination)                Korn et al. Nat Genet. 2008 Oct;40(10):1253-60
     A deletion found in tumour AML sample at 8p
                using unpaired analysis.

    Tumour sample vs Baseline




    Same deletion found in corresponding diagnostic
                  AML sample at 8p

    Tumour sample vs Baseline

    Normal sample vs Baseline

                    Need for pairing

    Tumour sample vs Baseline

    Normal sample vs Baseline

    Tumour sample vs Normal sample

Normal-Tumor Pairs

    Removed the outlier, colored by type, paired   52
•    Proportion of Cells, “c,” in a heterogeneous
    tumour sample harboring a Somatic genetic event

•    BAF and the logR ratio plots from one
    chromosome reveal three somatic hemizygous
    deletions occurring in three different proportions
    of cells.

•    Frequency distribution showing the number of
    SNPs included in the somatic deletions by the
    proportion of cells, “c,” in which these events
    occur. Some somatic deletions occur in over 80%
    of cells. Assuming that only cancer cells harbor
    somatic deletions, the proportion of cancer cells is
    then estimated as 80% in this sample.

•    Schematic illustrating the relationship between
                                                            SNP arrays in heterogeneous tissue: highly accurate collection of both germline
    the chronology of somatic events during                 and somatic genetic information from unpaired single tumor samples.
    tumorigenesis and the proportion of cancer cells
                                                            Assié et al Am J Hum Genet. 2008 Apr;82(4):903-15
    with these events. Early somatic events are
    present in all (or a great majority of) cancer cells,
    whereas late somatic events are only present in
    subsets of cells.
      Mixing proportion identification

•   Estimating copy number and mixing
    proportions from simulated data
    using OncoSNP.

•   The estimated copy number states
    and mixing proportions (grey) are
    comparable to the true values used
    for the simulations (black).

•   In the two regions of copy number 3
    that are incorrectly classified as
    copy number 4, an examination of
    the Bayes Factor shows that
    although the data favors the 4n
    amplification state, there is also    Identification of DNA copy number changes and loss-of-
    strong support for both the true      heterozygosity events in heterogeneous tumor samples: a
                                          Bayesian Mixtures of Genotypes approach on SNP array data
    state (3n amplification).             Yau C et al In preparation                            54
             Normal-Tumour Titration

•   intra-tumor heterogeneity (red)

•   stromal contamination only (black)

•   Both models infer the level of normal
    DNA contamination with good accuracy
    up to 50% contamination
•   At higher contamination levels, the
    stromal contamination only model has
    superior performance as it is able to
    borrow strength from all SNPs to infer
    the contamination level.
•   This provides more power to detect
    duplications at high contamination levels   Identification of DNA copy number changes and loss-of-
    than the intra-tumor heterogeneity          heterozygosity events in heterogeneous tumor samples: a
    model.                                      Bayesian Mixtures of Genotypes approach on SNP array data
                                                Yau C et al. In preparation

                   Detection of alterations
Detecting chromosomal alterations in cancer cell line
    and tumor samples.
      The intra-tumor heterogeneity model (red)
          indicates that approximately 50% of cell
          contain a different breakpoint location to
          the others whereas this feature is missed
          entirely by the stromal contamination only
          model (black)

      The near-triploid status of the cell line HT29 is
         correctly identified and copy number
         estimates are correctly derived even
         though the Log R Ratios are centered on
         zero for the copy number 3 state.

      The       two heterogeneous deletions are
            separated by an unaltered region,
            however, there is still good agreement
            between the mixing proportion estimates       Identification of DNA copy number changes and loss-of-
                                                          heterozygosity events in heterogeneous tumor samples: a
            given by the intra-tumor heterogeneity and
                                                          Bayesian Mixtures of Genotypes approach on SNP array data
            stromal-only models. This suggests we do
                                                          Yau C et al. In preparation
            not pay too severely when assuming
            independent mixing proportions in the
            intra-tumor heterogeneity model.                                                              56
                                     Recurrent events
 Overview of all genetic aberrations found with SNP
    array in 45 adult and adolescent ALL cases.
    Minimally involved regions are shown to the
    right of each chromosome.

 For each type of aberration, each line represents a
     different case.
         –     Blue lines are regions of uniparental disomy,
         –     light green lines are hemizygous deletions,
         –     dark green lines are homozygous deletions,
         –     red lines are copy-number gains.

 Note the high frequency of deletions involving
    chromosomes 9p21.3, 9p13.2, 7p12.2, 12p13.2,
    and 13q14.2 corresponding to the CDKN2A,
    PAX5, IKZF1, ETV6, and RB1 loci, respectively.


Microdeletions are a general feature of adult and adolescent acute lymphoblastic
leukemia: Unexpected similarities with pediatric disease.
Paulsson K et al, Proc Natl Acad Sci U S A. 2008 May 6;105(18):6708-13
                     Overlap of recurrences

•   Aberrations observed on chromosomes 11 and
    13 are shown with their bands, a subset of
    potential target genes in AML and regions of
     – gain (red),
     – loss (green)
     – aUPD (blue).

•    The scale at the bottom shows the length of
    each chromosome in megabases (Mb). The
    color gradient above each kind of aberration
    summarizes the data for that aberration.

•   Beware that GC content can induce systematic
    falsely identified aberrations

    Novel regions of acquired uniparental disomy discovered in acute myeloid leukemia.

    Gupta et al. Genes Chromosomes Cancer. 2008 Sep;47(9):729-39.

•   CLL

                           Typical workflow

•   Normalisation
     – GC Content Correction
     – Paired
     – Unpaired with appropriate baseline
•   Determination of Aberrations
     – Correct Genotype
     – Copy Number
•   Identification of recurrent locations
•   Test against germline sample if possible
     – Could it be an at-risk variant ?
•   Test against known variations
•   Validation
     – Identify precisely breakpoints
          •   Sequencing
     – Identify the frequency
     – Identify the Associated risk
     – Perform functional analysis
       High Throughput sequencing

•   More data
     – Better resolution ?

             Array tools extension ?

•   Direct application of SNParray tool to HTS “fails”
     – Too much noise:
         • Very variable Coverage at given location
         • No clear BAF defined

•   Develop specific tools on same concept
     – e.g. OncoSeq
•   There is more than coverage info in Sequence data    62

•   Issue with CNV and Exome
     – Difficult to set up a baseline when there is sporadic coverage
     – Exome sequencing is not recommended for CNV, or cancer analysis

•   Single-end vs Paired-end
     – Breakpoints are more likely covered by the non-sequenced inserts than the reads

•   Cancer samples do not match much of the References,
     –   Copy Number Variations
     –   Complex rearrangement
     –   Large number of mutations
     –   Heterogeneity of clones at given “location”

•   Very large level of False Positive with current methods
     – Difficult to have a Gold Standard

•   OncoSeq
     –   Extension from SNParray to evaluate Copy Number, LOH and mosaicism

•   Pindel
     –   Extension of InDel search for small event
     –   Requires paired-end reads

•   BreakDancer
     –   Include as well inter and intra-tumour translocations
     –   Requires paired-end

•   Genome STRiP
     –   Essentially Deletion
     –   Requires population of at least 20 samples to get power and heterogeneity

     – Somatic structural variation

•   …                                                                                64
                             Genome STRiP

                                                             Discovery and genotyping of genome structural polymorphism by
                                                                                    sequencing on a population scale (2011)
a)   Millions of end-sequence pairs from
     sequencing     libraries   show    aberrant                            Handsaker R et al. Nature Genetics 43, 269-276
     alignment locations, appearing to span vast
     genomic distances. Almost all of these
     observations derive not from true structural
     variants but from chimeric inserts in
     molecular sequencing libraries.

b)   A set of 'coherently aberrant' end-sequence pairs from many genomes. At this genomic locus, paired-end
     sequences (sequences of the two ends of the inserts in a molecular library) fall into two classes:
      i.     end-sequence pairs that show the genomic spacing expected given the insert size distribution of
             each sequencing library;
      ii.   end-sequence pairs that align to genomic locations unexpectedly far apart but which relate to their
            expected insert size distributions by a shared correction factor. A unifying model in which these eight
            read pairs from five genomes arise from a shared deletion allele converts all of these aberrant read
            pairs to likely observations. In the right panel, the black tick marks indicate genomic distance
            between left and right end sequences; the black curves indicate insert size distributions of the
            molecular library from which each sequence-pair was drawn.
     Genome STRucture in Populations
a)   Population-scale sequence data contain two
     classes of information: technical features of the
     sequence data within a genome and population-
     scale patterns that span all the genomes analyzed.
     Technical features include breakpoint-spanning
     reads, paired-end sequences and local variation in
     read depth of coverage. Genome STRiP combines
     these with population-scale patterns that span
     many genomes, including: the sharing of structural
     alleles by multiple genomes; the pattern of
     sequence heterogeneity within a population; the
     substitution of alternative structural alleles for each
     other; and the haplotype structure of human
     genome polymorphism.

b)   'Variation discovery' involves identifying the
     structural alleles that are segregating in a
     population. The power to observe a variant in any
     one genome is only partial, but the evidence
     defining a segregating site can be derived from
     many genomes at once. 'Population genotyping'
     requires accurately determining the allelic state of
     each variant in every diploid genome in a
                                                               Discovery and genotyping of genome structural polymorphism by
     population.                                                                      sequencing on a population scale (2011)
                                                                              Handsaker R et al. Nature Genetics 43, 269-276
                 Clipping REveals Structure
a)   Illustration of SV analysis using discordantly mapped
     paired-end reads versus mapping using soft-clipping

b)   An example of using soft-clipping signature to identify an
     interchromosomal translocation. Reference genome
     sequence is at the top and next-generation sequencing
     reads are shown below it with highlighted mismatches to
     the    reference   (red     letters)  and    soft-clipping
     subsequences not aligned to the reference (gray letters).
     In this example, the soft-clipping subsequences map to
     chromosome X, revealing a chromosome 2 to
     chromosome X translocation.

c)   The five-step CREST algorithm:
      1.   extraction of soft-clipped reads in the binary
           alignment/map (bam) file;
      2.   assembly of soft-clipped reads at a putative
           breakpoint into a contig;
      3.   mapping of the contig against the reference
           genome to identify candidate partner breakpoints;
      4.   identification of all possible soft-clipped reads and
           assembly into a contig;
      5.   alignment of the contig derived from the partner
           back to the reference genome. A match to the                                                                          67
           initial breakpoint is considered a SV.                  CREST maps somatic structural variation in cancer genomes with base-pair
                                                                          resolution (2011) Jianmin Wang et al. Nature Methods 8, 652–654

•   CNP are very common in the human genome
     – It is easier to have a functional role for them

•   Common ones are well tagged by existing markers
     – Does not bring much new loci, but function

•   Hard to characterize uniformly

•   Not yet much proven functional

•   Still very key in Cancer
     – More challenging to identify
     – More essential for the understanding

•   Brand new challenge to call with Next Generation Sequencing data

•   Catalogue of CNPs
     – GSV and WTCCC effort
     – Use of the 1000 genome project

•   Methods
     – Improvements of the algorithms
     – Improvements of the Computing power

•   Other technologies
     – Use of expression data
     – Use of High-Throughput-Sequencing
     – Single molecule sequencing

                    Useful references

•   Collections of known aberrations:
     – Mitelman Database of Chromosome Aberration in Cancer
          •   http://cgap.nci.nih.gov/Chromsomes/Mitelman
          •   cytogenetic confirmed

     – Database of Genomics Variants
          •   Zhang, J et al. Development of bioinformatics resources for display and analysis of copy number
              and other structural variants in the human genome. Cytogenet. Genome Res. (2006).

     – Redon, R. et al. Global variation in copy number in the human genome.
       Nature, (2006).

     – Iafrate, A.J. et al. Detection of large-scale variation in the human genome.
       Nat. Genet. (2004).

     – McCarrol & Korn (2008)
          •   Based on SNP 6.0 in 270 HapMap samples

     – Genome Structural Variation Consortium
          •   Conrad D et al. Origins and functional impact of copy number variation in the human genome
              Nat. Genet. (2009)                                                                                70

•   OncoSNP
     –   Analytical tool for characterising copy number alterations and LOH events in cancer samples from SNP genotyping data,
     –   A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism
         genotyping data
            Christopher Y et al (2010) Genome Biolog,11:R92

•   R- packages:
     –   DNACopy:
            •   A Package for Analyzing DNA Copy data,
            •   A faster circular binary segmentation algorithm for the analysis of array cgh data.
                     Venkatraman, E. S. and Olshen, A. B. (2007). Bioinformatics, 23: 657 – 663

     –   snapCGH:
            •   Segmentation, Normalization and Processing of aCGH Data,
            •   BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data.
                     Marioni, J. C., Thorne, N. P., and Tavaré, S. (2006).Bioinformatics 22: 1144 – 1146

     –   BeadarraySNP:
            •   package for the analysis of Illumina genotyping BeadArray data,
            •   High-resolution copy number analysis of paraffin-embedded archival tissue using SNP BeadArrays.
                     Oosting J et al. Genome Res. 2007 Mar;17(3):368-76

     –   CNVTools
            •   R package for robust CNV case control and quantitative trait association.
            •   A robust statistical method for case-control association testing with Copy Number Variation.
                     Barnes C, Plagnol V, Fitzgerald T, Redon R, Marchini J, Clayton D, Hurles ME. Nature Genetics, 2008 Oct;40(10):1245-52

•   Web interface:
     –   Integration of CNV results across multiple samples                                                                                   71
     –   http://www.well.ox.ac.uk/~jcazier/GWA_Viewer.html

To top