Embed
Email

Application to estimate haplotypes for multiallelic present-absent

Document Sample

Shared by: yurtgc548
Categories
Tags
Stats
views:
0
posted:
12/11/2011
language:
pages:
8
Application to estimate haplotypes for

multiallelic present-absent loci



Robert Nowak



Electronics Systems Institute, Warsaw University of Technology

Nowowiejska 15/19, 00-665 Warsaw r.m.nowak@elka.pw.edu.pl





Summary. The article presents an algorithm and an application to estimate haplo-

type frequencies from genotype data for unrelated individuals. Presented approach

can handle loci with multiple alleles as well as silent (null) alleles. The mathematical

model and an expanded Expectation-Maximization algorithm is described. The com-

puter program, called NullHap, available freely at http://staff.elka.pw.edu.pl/

~rnowak2/nullhap implements presented ideas. Comparison with known software to

estimate haplotypes: Arlequin, PHASE and Haplo-IHP proves the advantage pre-

sented method.







1 Introduction

Laboratory techniques used to determine local haplotypes [4] are often too

expensive for large-scale studies. The lack of phase information (as depicted

in Fig. 1) provided by the popular typing methods could be overcome by us-

ing likelihood-based calculations [3], which estimates haplotype frequencies in

population, and reconstruct of haplotype pair in an individual. This approach

is more cost-effective and powerful than linkage analysis [2], and gives more

information than single marker-based methods [7].









Fig. 1. The lack of phase information and ambiguous of the heterozygous status for

present-absent genes after popular typing techniques.

2 Robert Nowak



The maximum likelihood approach to haplotype estimation from unrelated

individuals was widely discussed, and a number of programs is available for

that purpose. The Expectation - Maximisation algorithm (EM), proposed in

[6] is the most popular (used in tested applications Arlequin [5] and PHASE

[9]), but also others methods were applied: Bayesian networks [8], Monte Carlo

[1] and Hidden Markov Model.

The polymorphism for present-absent genes, where ambiguous as regards

to the heterozygous status of the individuals (Fig. 1) is poorly analyzed in

literature. The only available computer program designed to handle this prob-

lem, proposed in [10] can estimate haplotypes only for biallelic loci.





2 Algorithm

In a diploid organism, when the k polymorphic loci is analyzed and each locus

has li different variants (optionally including null variant) for i–th locus, the

k

number of different haplotypes is H = i=1 li , and the number of different

1

haplotype resolutions is R = 2 H ∗ (H + 1). The lack of phase information as

well as ambiguous to the heterozygous state for locus with null variant cause

that only the G (equation 1) different genotypes could be observed.



k

(li − δi )(li + 1 − δi ) + 2δi 1 locus with null allele

G = , δi = (1)

2 0 locus without null allele

i=1



The number of haplotype resolutions rj which gives genotype j (when

phase information is lost) is described by equation 2, where sj is the number

of observed heterozygous and tj is the number of observed (not null) alleles for

loci with null allele. The number of haplotype resolutions grows exponentially

with number of observed loci, thus full space search can not be applied.



2sj −1 ∗ 3tj for sj > 0

rj = 3tj +1 (2)

2 for sj = 0



2.1 Maximum likelihood approach to estimate haplotypes



A sample of genotype data from unrelated n individuals is simplified to a

vector S = (n1 , n2 , ..., nG ), where G is the number of different genotype data

(with lack of phase information), and nj is the number of individuals having

G

j-th genotype, j=0 nj = n. In maximum likelihood approach haplotype

probability hi is estimated to maximise the probability of the given sample

(equation 6).

The condition probability of sample S, given each genotype probability

gi , assume unrelatedness of individuals in sample is provided in equation (3),

where α not depends on gi .

Application to estimate haplotypes for multiallelic present-absent loci 3





G F

n! n n

P (S |g1 , g2 , ..., gG ) = ∗ g j =α gj j (3)

n1 ! ∗ n2 ! ∗ ... ∗ nG ! j=1 j j=1



The probability of genotype gj is the sum of probabilities of respective hap-

lotype pairs zmn , and with Hardy-Weinberg assumption, is calculated from

haplotype probabilities, as shown in equation 4, where zmn is the probability

of haplotype pair m and n, rj is a number of haplotype pairs for given j-th

genotype, and hm , hn are the probabilities of haplotypes m and n respectively.

rj

h2

m for m = n

gj = zmn , where zmn = (4)

2 hm hn for m = n

i=0



The estimation of haplotypes probability to maximise the probability of ob-

served sample can be described as optimization, the equation 6 summarizes

considered approach.



G rj

arg max P (S |h1 , h2 , ..., hH ) = arg max ( zmn )nj , (5)

h1 ,h2 ,...,hH h1 ,h2 ,...,hH j=1 i=0



h2

m for m = n

zmn =

2 hm hn for m = n



2.2 Extended EM algorithm



An Expectation-Maximization algorithm alternates between performing an

expectation step E (t) , which computes an expectation value of unknown pa-

rameters, here the probabilities of haplotype pairs, and a maximization step

M (t) , which computes the value of parameters by maximising the probability

of observed data. The parameters found on the M (t) step are then used to

begin another E (t+1) step, and the process is repeated until the parameters

are changed.

The EM algorithm peaks at local optimum. In presented solution this algo-

rithm is used to maximize equation 6. The algorithm details for polymorphism

with k observed loci, li variants for i-th locus, and sample S = (n1 , n2 , ..., nG )

are supplied below.



Initiation



The initial haplotype pair probabilities (the E 0 step) are set as described

in equation 6. For each haplotype pair the probability depends only on the

number of resolutions for given genotype. It is similar to initiation in [6].



(0) 1

zmn = where the mn gives the j genotype (6)

rj

4 Robert Nowak



Maximization step

The genotype probability is calculated as sum of responding haplotype pairs

probabilities, and the next values of haplotype pair probabilities are calculated

to maximize given sample probability. Details in equation 7.

(t) rj

(t+1) nj zmn (t) (t)

zmn = ∗ (t) , where mn gives genotypej, gj = zx (7)

n gj x



Expectation step

Haplotype probabilities hm are calculated from given haplotype pair proba-

bilities zmn , it is a half of sum of each haplotype pair probability when given

haplotype occurs. The next expected haplotype pair probabilities are calcu-

lated using haplotype probabilities as described in equation 8.

(t)

(t+1) (hm )2 for m = n 1 (t) (t)

zmn = (t) (t) where h(t) = (

m zim + zmj ) (8)

2 hm hn for m = n 2 i j



Stop conditions

Algorithm stops when the stability of estimations between following steps

are obtained. In presented approach it means the absolute difference between

calculated probabilities is less then ǫ (equation 9).

R

(t+1) (t)

|zi − zi | < ǫ (9)

i=1



Finally, the haplotype probabilities are calculated (by another E step), as

well as the conditional probability of the haplotype pair, given genotype are

estimated using equation 10.

zmn zmn

zmn |gj = = rj (10)

gj x zx







3 NullHap: validation and comparison with other

applications to haplotype estimation

Considered algorithm was used in application called NullHap. The implemen-

tation in C++, depends on the Boost libraries and faif library [12]. The source

code and binaries for Windows NT/2000/XP and Debian Linux are available

at project web page [11].

Application was tested on simulated and real data sets. Results were com-

pared with previously mentioned programs: Arlequin [5], PHASE [9] and

Haplo-IHP [10]

The main advantage of NullHap over others known applications is ability

to handle problems when multiallelic locus containing null variant (Tab. 1).

Application to estimate haplotypes for multiallelic present-absent loci 5



name biallelic multiallelic null variants

Arlequin + + -

PHASE + + -

Haplo-IHP + - +

NullHap + + +

Table 1. Short comparison of analyzed applications to estimate haplotypes







Test on generated data sets



To compare the ability to calculate the haplotypes probabilities for consid-

ered applications the most probable sample was generated for five polymor-

phisms with varying locus characteristics. An example of assumed and es-

timated probabilities is shown in Tab. 2 (others examples are available at

web [11]). In Tab. 3 the results of simulations are summarized. The error was

calculated as provided in equation 11, where x is assumed probability, and x∗

is calculated one.

N

1 x − x∗

error = | | (11)

N i=1

x







haplotype probability hi

assumed Arlequin PHASE Haplo-IHP NullHap

A0B0 0.2 0.042 0.04 0.25 0.2

A0B1 0.2 0.126 0.12 0.25 0.2

A1B0 0.2 0.147 0.14 0.25 0.2

A1B1 0.2 0.442 0.42 0.25 0.2

A2B0 0.1 0.021 0.07 0 0.1

A2B1 0.1 0.221 0.21 0 0.1

error - 77% 67% 50% 0%

Table 2. Example of simulated data. Two loci polymorphism was considered: prob-

abilities were assumed for each haplotype, next the sample was generated, than

haplotype probabilities were estimated, finally results were compared.







Next, the effect of sample size on the exactness of the method were in-

vestigated, therefore the minimal sample size required to provide reliable es-

timations could be obtained. The 10 samples of 25,50,100,200,400 and 1000

from an infinite population in Hardy - Weinberg equilibrium characterized by

haplotype distribution as those given 4-th and 5-th example in tab. 3. The

haplotype probability were estimated, and median of errors (equation 11) are

illustrated in fig. 2.

6 Robert Nowak



No example error

description Arlequin PHASE Haplo-IHP NullHap

1 biallelic loci: A(A0, A1), B(B0, B1), 61% 50% 0.7% 0%

C(C0, C1), null variants: A0, B0 and C0

2 biallelic loci: A(A1, A2), B(B1, B2), no 0% 0% 0% 0%

null variants

3 multiallelic loci: A(A0, A1, A2), B(B0, 62% 62% 100% 0%

B1, B2), null variants: A0, B0

4 multiallelic loci: A(A1, A2, A3), B(B1, 0% 1% 78% 0%

B2, B3), no null variants

5 multiallelic loci, A(A0, A1, A2), B(B0, 77% 67% 50% 0%

B1), null variants: A0, B0

Table 3. Haplotype estimation probability error for considered applications









Fig. 2. Sample size effect on organism described in tab. 2. The figure describes the

median of error in function of sample size for two examples (example 4 and 5 from

tab. 3.





Test on real data sets



The applications able to analyze multiallelic loci (Arlequin, PHASE and Null-

Hap) were used to calculate HLA (DRB1-DQB1) data set for 99 Polish sup-

plied by [5]. The case contains two multiallelic loci (36 and 14 variants),

without null variants. The results, shown in tab. 4, are very similar for each

computer program.

The data set containing the KIR loci for 200 Irish [10], was used to check

the applications takes into account null variants (Haplo-IHP and NullHap).

Each locus is biallelic and has null variants. Results provided in tab. 5 shows

similar results.

Tests of simulated samples proves ability to estimate assumed haplotype

probabilities. In case of polymorphisms able to calculate by other known com-

puter programs the results produced by NullHap were similar.

Application to estimate haplotypes for multiallelic present-absent loci 7



haplotype Arlequin NullHap Phase haplotype Arlequin NullHap Phase

1500 0602 0.16 0.156 0.153 1301 0603 0.055 0.055 0.05

0700 0200 0.1 0.101 0.101 0401 0302 0.035 0.035 0.035

0101 0501 0.091 0.091 0.091 0104 0501 0.035 0.035 0.035

0301 0200 0.076 0.076 0.075 0806 0402 0.025 0.025 0.025

1101 0301 0.066 0.066 0.063 0700 0303 0.02 0.02 0.02

Table 4. The haplotype estimations for HLA data set for 99 Polish [5]. The differ-

ence between estimated probabilities between programs is less then 2%.





2DS2 2DL2 3DL3 2DL5B 2DL1 3DS1 Haplo-IHP NullHap

0 0 1 0 1 0 0.546 0.55

0 0 1 0 1 1 0.101 0.1

1 1 0 1 1 1 0.095 0.094

1 1 0 0 0 0 0.075 0.083

1 1 0 0 1 0 0.064 0.056

1 1 0 1 1 0 0.028 0.028

0 0 0 0 1 0 0.019 0.028

1 1 1 1 1 0 0.021 0.021

1 1 1 0 1 1 0.019 0.019

0 0 0 1 1 1 0.009 0.01

Table 5. The haplotype estimations for KIR data set, 200 Irish, [10]. The difference

of estimated probabilities between programs is about 3%.







Performance tests



The analysis of computational time in different scenarios is presented in tab. 6.

All computations were achieved on Celeron M 1.6GHz, 1GB RAM, under

Debian Linux or Windows XP. The number of loci and sample size influence

to computational time.





4 Conclusion

Presented application can effectively estimate haplotype probabilities with

performance similar to other computer programs. It should be emphasized

that the main advantage of created application is the ability to estimate hap-

lotypes for every type of polymorphism, in particular polymorphisms with

multiallelic loci with null variants. The tests for simulated data shows that

NullHap obtain results closer to assumed than others computer programs.

The demonstrated application is under development, and some improve-

ments are planned such as the ability to consider the cases of missing data

and the additional step removing unimportant haplotypes, to speed-up com-

putations. The new versions will be available at project web side.

8 Robert Nowak



number of time for application

loci haplotypes observations Arlequin Phase HaploIHP NullHap

2 6 100 0.13s 46s 0.5s 0.07s

2 9 100 0.06s 47s 0.15s 0.04s

3 8 100 0.04s 69s 0.58s 0.02s

2 504 99 0.22s 53s - 37s

2 540 99 0.34s 58s - 39s

7 128 200 - - 145s 13s

9 512 200 - - 1300s(8s) 209s

11 2048 200 - - 24h(10s) 3h

Table 6. Computational time for considered applications (HaploIHP in parenthesis

with greedy algorithm). Results only for applications able to estimate haplotypes

for given polymorphism.







The application would be confirmed by larger studies involving experimen-

tal work on allele typing.





References

1. Boettcher P, Pagnacco G, Stella A (2004) A Monte Carlo Approach for Estima-

tion of Haplotype Probabilities in Half-Sib Families. J. Dairy Sci., 87:4303–4310

2. Botstein D, Risch N (2003) Discovering genotypes underlying human phe-

notypes: past successes for Mendelian disease, future approaches for complex

disease. Nat.Genet., 33:228–237

3. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete

data via the em algorithm. Journal of the Royal Statistical Society, 39: 1–39

4. Douglas J at al. (2001) Experimentally derived haplotypes substantially in-

crease the efficiency of linkage disequilibrium studies. Nat. Genet., 28:361–364

5. Excoffier L, Laval G, Schneider S (2005) Arlequin ver. 3.0: An integrated

software package for population genetics data analysis. Evolutionary Bioinfor-

matics Online, 1:47–50

6. Excoffier L, M. Slatkin M (1995) Maximum-likelihood estimation of molecular

haplotype frequencies in a diploid population. Mol. Biol. Evol., 12:921–927

7. Morris R, Kaplan N (2002) On the advantage of haplotype analysis in the

presence of multiple disease susceptibility alleles. Gen. Epidem., 23:221–233

8. Niu T, Qin Z, Xu X, Liu J (2002) Bayesian haplotype inference for multiple

linked single nucleotide polymorphisms. Am. J. Hum. Genet., 70:157 - 169

9. Stephens M, Smith N, Donnelly P (2001) A new statistical method for haplo-

type reconstruction from population data. Am. J. Hum. Genet., 68:978–989

10. Yoo Y, Tang J, Kaslow R, Zhang K (2007) Haplotype inference for present-

absent genotype data using previously identified haplotypes and haplotype pat-

terns. Bioinformatics, 23:2399–2406

11. NullHap web side: http://staff.elka.pw.edu.pl/~ rnowak2/nullhap/

12. The faif library http://faif.sourceforge.net/



Related docs
Other docs by yurtgc548
项目概述
Views: 0  |  Downloads: 0
雅比斯的禱告The Prayer of Jabez
Views: 0  |  Downloads: 0
無投影片標題
Views: 1  |  Downloads: 0
温故校园
Views: 0  |  Downloads: 0
没有幻灯片标题
Views: 0  |  Downloads: 0
氫能源
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!