Using genetics to study human history and natural selection
David Reich Harvard Medical School Depatment of Genetics Broad Institute
t tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatgg c g cagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttacta a g t t acatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtag a c c cagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaa cttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatc g ctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaaga a g tcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattag c t aggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccacc c ccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctca g agtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagat a t t t tacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgtt c c c ttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtgg g tgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctg a ggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaat tattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaac g t g tgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtt a c a tacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttat ttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggca t g gatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaa c a g attagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacc c g tgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtc c g aaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatt a g t tctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttatta a c tttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttcttt cttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactaga g g g gaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttaggg a c a g ggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttg a aggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca
A 2-part talk:
Section 1: How human history affects human genetic variation
Section 2: Detecting selection by the pattern of genetic variation and finding disease genes
Section 1
How does human history affect genetic variation?
A genome-wide survey of Linkage Disequilibrium
Linkage disequilibrium is a phenomenon whereby genetic variants are associated: people who have one tend to have a second as well
Section 1
Linkage Disequilibrium Explained
Emergence of Variations Over Time
Variations in Chromosomes Within a Population
Common Ancestor
Disease Mutation
time
present
What Determines Extent of LD?
Section 1
Disease-Causing Mutation
2,000 gens. ago
1,000 gens. ago
Time = present
Section 1
How Far Does Association (LD) Extend Between Neighboring Common Sites?
• Theoretical:
Range of uncertainty
3-8 kb
160kb
0kb 5kb 10kb 20kb
40kb
80kb
Section 1
Strategy for Assessing Extent of LD
5 5 10 20 40 80
160kb
0kb 5kb 10kb 20kb
40kb
80kb
Distance from core single nucleotide polymorphism (SNP)
• 19 regions • 44 Caucasian samples from Utah • a great deal of DNA sequencing per sample
Section 1
Section 1
A Genome-Wide Assessment of Linkage Disequilibrium
Disease Gene Mapping Human history
Section 1
MYSTERY: What explains the long-range LD?
Important
event in population history?
Section 1
Positive Control: 48 Swedes
1 0.9 0.8
Utah LD Curve
Linkage Disequilibrium D'
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3.5 5kb 10kb 20kb 40kb 80kb 160kb
Sweden LD Sweden LD With Sign of D' set by Utah
Identical pattern to Utah
Distance Between SNPs (Base Pairs)
Section 1
96 Nigerians (Yoruba)
1 0.9 0.8 0.7
Utah LD Curve
Linkage Disequilibrium D'
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 3.5
Nigeria LD with sign of D' set by Utah Nigeria LD
Much Less LD
Associations in Africans a SUBSET of those in Caucasians
5kb
10kb
20kb
40kb
80kb
160kb
Distance Between SNPs (Base Pairs)
MUST be influenced by population history
Confirmation of less LD in Africans from Direct DNA Sequencing
1 0.9 0.8 0.7
Mean |D'|
Section 1
313 101 174 86 6 48 4 63 20 83 98 16 Nigerian Utah 67 56
0.6 0.5 0.4 0.3 0.2 0.1 0 500bp 5kb 10kb 20kb 40kb 80kb 160kb
Anna DiRienzo also shows this pattern
More evidence from Genotyping ~5,000 SNPs (Gabriel et al. 2002)
1 0.9 0.8 0.7
Section 1
Caucasian African-American Asian Yoruban
Mean |D'|
0.6 0.5 0.4 0.3 0.2 0.1 0 0 50,000 100,000 Distance (bp) 150,000
K. Kidd, J. Kidd, Sarah Tishkoff also show this
Explanation: Bottleneck or „Founder Section 1 Effect‟ in History of North Europeans Ancestral
Population
• likely <10 founding chromosomes
What was this event?
~100,000 years ago
(1) Out of Africa? (2) Founding of Europe?
North Europeans
Yoruba Ancestors
Section 1
Open Mysteries
• what caused the bottleneck event? “Out of Africa” migration? • how many people involved? When did it occur?
• can we better understand when the founder event occurred, and how many people involved?
Acknowledgements for Section 1
Collaborators:
Michele Cargill Stacey Bolk James Ireland Pardis C. Sabeti Daniel J. Richter Thomas Lavery Rose Kouyoumjian Shelli F. Farhadian Ryk Ward Eric S. Lander
Samples:
Leif Groop Richard Cooper Charles Rotimi
Section 2
Using Long-Range Linkage Disequilibrium to Detect Positive Selection in the Genome
Overview
Section 2
1. The difficulty of detecting genomic regions affected by natural selection 2. The long-range haplotype test 3. Results for two genes: G6PD and CD40 ligand
Existing formal tests for selection DNA Sequence analysis
Tajima‟s D HKA test Mcdonald and Kreitman Fu and Li‟s D Ka/Ks ratio
Section 2
Weak
Genotyping-based tests
Not general at present
Our test is based on the relationship between Section 2 allele frequency and extent of linkage disequilibrium
No selection
Young alleles: • low frequency • long-range LD Old alleles: • low or high frequency • short-range LD
Positive Selection
Young alleles: • high frequency • long-range LD
The signal of selection
Linkage Disequilibrium (Homozygosity)
Section 2
Positive Selection Neutrality
frequency
Paradigm of the Core Region
Section 2
gene 1 2 3 4 5
Core Haplotypes
Long-range multi-SNP haplotypes
Core markers Long-range markers
C/T A/G A/G C/T C/T
Section 2
C/T
gene
1 2 3 4 5
Decay of LD
Long-range multi-SNP haplotypes
Core markers
gene
Section 2
Long-range markers
C/T
A/G A/G C/T C/T C/T
C T A
C T T C G G C
G C
C
T
T
3
Decay of homozygosity
(probability, at any distance, that any two haplotypes that start out the same have all the same SNP genotypes)
T
T C
100%
75%
35%
18%
Section 2
Two genes associated with malaria resistance
G6PD (1960‟s)
• well established association to malaria resistance • selection demonstrated in 2001 by Tishkoff et al.
CD40 ligand (2002):
• Recent association by Sabeti et al. • involved in immune regulation
Experimental Design
-480kb
Section 2
G6PD
+220kb telomere
G6PD
-480kb -180kb
(11 SNPs in core, 14 at long distances)
Gene
TNFSF5
+220kb +520kb telomere
CD40 ligand
-180kb
(7 SNPs in core, 14 at long distances)
+520kb
Gene
Experimental Design DNA samples from 231 African men Yoruba (Nigeria) Beni (Nigeria) Shona (Zimbabwe) Perfect phase (X chromosome)
Section 2
Core haplotypes G6PD
Africans non-Africans (230) (95) 1 2 3 4 5 6 7 8 9
Section 2
CD40 ligand
Africans non-Africans (231) (91) 1 2 3 4 5 6
38 72 4 28 28 14 41 5
4 61 13
5 91 9 78 30 1
77 21 7 7
“A-” protective haplotype
17
G6PD: long-range haplotype diversity
G6PD-corehap1 G6PD-corehap6
Section 2
G6PD-corehap3
G6PD-corehap7
G6PD-corehap4
G6PD-corehap8 “A-” protective haplotype
G6PD-corehap5
G6PD-corehap
Section 2
G6PD: homozygosity vs. distance
EHH
Distance from the core region (kb)
Section 2
G6PD: computer simulation vs. data
Core haplotype 8 P << 0.0008
Relative EHH
Core haplotype frequency
Section 2
G6PD: P-values from simulation
P- value
Distance from the core region ( kb)
G6PD also stands out in comparison to 7 control regions
Section 2
Relative EHH
Core
haplotype
frequency
CD40 ligand: long-range haplotype diversity
corehap1 corehap4
Section 2
corehap2
corehap5
corehap3
Section 2
CD40 ligand: homozygosity vs. distance
EHH
Distance from the core region (kb)
Section 2
CD40 ligand: computer simulation vs. data
Core haplotype 4 P << 0.0011
Relative EHH
Core haplotype frequency
Section 2
CD40 ligand: P-values from simulation
P- value
Distance from the core region ( kb)
CD40 ligand also stands out in comparison to 7 control regions
Section 2
Relative EHH
Core
haplotype
frequency
Malaria resistance arose in last 10,000 years in Africa
Section 2
Long-range linkage disequilibrium also gives a direct estimate of the date
~2,500 years ago for G6PD
~6,500 years ago for CD40 ligand
Traditional tests fail to detect the effect Tajima‟s D HKA test Mcdonald and Kreitman Fu and Li‟s D Ka/Ks ratio
Section 2
Not significant in our data. This test is a powerful way to detect selection in last 10,000 years
Section 2
Conclusions: Powerful general approach for detecting selection
1 2 3 4
Section 2
Conclusions: Powerful general approach for detecting selection
1 2 3 4 5
Section 2
Conclusions: Powerful general approach for detecting selection
1 2 3 4
Screen the genome for Postive Selection
Section 2
Conclusions: Genome-wide screen for natural selection
We can find disease genes without patients!
What‟s coming…
Section 2
1. Generalization of the long-range haplotype test
2. Application of the approach genome-wide • Haplotype map data set • Disease gene screen data sets
Acknowledgements for Section 2
Pardis C. Sabeti John Higgins Haninah Z.P. Levine Daniel J. Richter Stephen F. Schaffner Stacey Gabriel Jill V. Platko Nicholas J. Patterson Gavin J. McDonald Hans C. Ackerman Sarah J. Campbell David Altshuler Richard Cooper Ryk Ward Eric S. Lander
Note
The 3rd section of the talk is not included here because it presents data that have not yet been published.