Application to estimate haplotypes for
multiallelic present-absent loci
Robert Nowak
Electronics Systems Institute, Warsaw University of Technology
Nowowiejska 15/19, 00-665 Warsaw r.m.nowak@elka.pw.edu.pl
Summary. The article presents an algorithm and an application to estimate haplo-
type frequencies from genotype data for unrelated individuals. Presented approach
can handle loci with multiple alleles as well as silent (null) alleles. The mathematical
model and an expanded Expectation-Maximization algorithm is described. The com-
puter program, called NullHap, available freely at http://staff.elka.pw.edu.pl/
~rnowak2/nullhap implements presented ideas. Comparison with known software to
estimate haplotypes: Arlequin, PHASE and Haplo-IHP proves the advantage pre-
sented method.
1 Introduction
Laboratory techniques used to determine local haplotypes [4] are often too
expensive for large-scale studies. The lack of phase information (as depicted
in Fig. 1) provided by the popular typing methods could be overcome by us-
ing likelihood-based calculations [3], which estimates haplotype frequencies in
population, and reconstruct of haplotype pair in an individual. This approach
is more cost-effective and powerful than linkage analysis [2], and gives more
information than single marker-based methods [7].
Fig. 1. The lack of phase information and ambiguous of the heterozygous status for
present-absent genes after popular typing techniques.
2 Robert Nowak
The maximum likelihood approach to haplotype estimation from unrelated
individuals was widely discussed, and a number of programs is available for
that purpose. The Expectation - Maximisation algorithm (EM), proposed in
[6] is the most popular (used in tested applications Arlequin [5] and PHASE
[9]), but also others methods were applied: Bayesian networks [8], Monte Carlo
[1] and Hidden Markov Model.
The polymorphism for present-absent genes, where ambiguous as regards
to the heterozygous status of the individuals (Fig. 1) is poorly analyzed in
literature. The only available computer program designed to handle this prob-
lem, proposed in [10] can estimate haplotypes only for biallelic loci.
2 Algorithm
In a diploid organism, when the k polymorphic loci is analyzed and each locus
has li different variants (optionally including null variant) for i–th locus, the
k
number of different haplotypes is H = i=1 li , and the number of different
1
haplotype resolutions is R = 2 H ∗ (H + 1). The lack of phase information as
well as ambiguous to the heterozygous state for locus with null variant cause
that only the G (equation 1) different genotypes could be observed.
k
(li − δi )(li + 1 − δi ) + 2δi 1 locus with null allele
G = , δi = (1)
2 0 locus without null allele
i=1
The number of haplotype resolutions rj which gives genotype j (when
phase information is lost) is described by equation 2, where sj is the number
of observed heterozygous and tj is the number of observed (not null) alleles for
loci with null allele. The number of haplotype resolutions grows exponentially
with number of observed loci, thus full space search can not be applied.
2sj −1 ∗ 3tj for sj > 0
rj = 3tj +1 (2)
2 for sj = 0
2.1 Maximum likelihood approach to estimate haplotypes
A sample of genotype data from unrelated n individuals is simplified to a
vector S = (n1 , n2 , ..., nG ), where G is the number of different genotype data
(with lack of phase information), and nj is the number of individuals having
G
j-th genotype, j=0 nj = n. In maximum likelihood approach haplotype
probability hi is estimated to maximise the probability of the given sample
(equation 6).
The condition probability of sample S, given each genotype probability
gi , assume unrelatedness of individuals in sample is provided in equation (3),
where α not depends on gi .
Application to estimate haplotypes for multiallelic present-absent loci 3
G F
n! n n
P (S |g1 , g2 , ..., gG ) = ∗ g j =α gj j (3)
n1 ! ∗ n2 ! ∗ ... ∗ nG ! j=1 j j=1
The probability of genotype gj is the sum of probabilities of respective hap-
lotype pairs zmn , and with Hardy-Weinberg assumption, is calculated from
haplotype probabilities, as shown in equation 4, where zmn is the probability
of haplotype pair m and n, rj is a number of haplotype pairs for given j-th
genotype, and hm , hn are the probabilities of haplotypes m and n respectively.
rj
h2
m for m = n
gj = zmn , where zmn = (4)
2 hm hn for m = n
i=0
The estimation of haplotypes probability to maximise the probability of ob-
served sample can be described as optimization, the equation 6 summarizes
considered approach.
G rj
arg max P (S |h1 , h2 , ..., hH ) = arg max ( zmn )nj , (5)
h1 ,h2 ,...,hH h1 ,h2 ,...,hH j=1 i=0
h2
m for m = n
zmn =
2 hm hn for m = n
2.2 Extended EM algorithm
An Expectation-Maximization algorithm alternates between performing an
expectation step E (t) , which computes an expectation value of unknown pa-
rameters, here the probabilities of haplotype pairs, and a maximization step
M (t) , which computes the value of parameters by maximising the probability
of observed data. The parameters found on the M (t) step are then used to
begin another E (t+1) step, and the process is repeated until the parameters
are changed.
The EM algorithm peaks at local optimum. In presented solution this algo-
rithm is used to maximize equation 6. The algorithm details for polymorphism
with k observed loci, li variants for i-th locus, and sample S = (n1 , n2 , ..., nG )
are supplied below.
Initiation
The initial haplotype pair probabilities (the E 0 step) are set as described
in equation 6. For each haplotype pair the probability depends only on the
number of resolutions for given genotype. It is similar to initiation in [6].
(0) 1
zmn = where the mn gives the j genotype (6)
rj
4 Robert Nowak
Maximization step
The genotype probability is calculated as sum of responding haplotype pairs
probabilities, and the next values of haplotype pair probabilities are calculated
to maximize given sample probability. Details in equation 7.
(t) rj
(t+1) nj zmn (t) (t)
zmn = ∗ (t) , where mn gives genotypej, gj = zx (7)
n gj x
Expectation step
Haplotype probabilities hm are calculated from given haplotype pair proba-
bilities zmn , it is a half of sum of each haplotype pair probability when given
haplotype occurs. The next expected haplotype pair probabilities are calcu-
lated using haplotype probabilities as described in equation 8.
(t)
(t+1) (hm )2 for m = n 1 (t) (t)
zmn = (t) (t) where h(t) = (
m zim + zmj ) (8)
2 hm hn for m = n 2 i j
Stop conditions
Algorithm stops when the stability of estimations between following steps
are obtained. In presented approach it means the absolute difference between
calculated probabilities is less then ǫ (equation 9).
R
(t+1) (t)
|zi − zi | < ǫ (9)
i=1
Finally, the haplotype probabilities are calculated (by another E step), as
well as the conditional probability of the haplotype pair, given genotype are
estimated using equation 10.
zmn zmn
zmn |gj = = rj (10)
gj x zx
3 NullHap: validation and comparison with other
applications to haplotype estimation
Considered algorithm was used in application called NullHap. The implemen-
tation in C++, depends on the Boost libraries and faif library [12]. The source
code and binaries for Windows NT/2000/XP and Debian Linux are available
at project web page [11].
Application was tested on simulated and real data sets. Results were com-
pared with previously mentioned programs: Arlequin [5], PHASE [9] and
Haplo-IHP [10]
The main advantage of NullHap over others known applications is ability
to handle problems when multiallelic locus containing null variant (Tab. 1).
Application to estimate haplotypes for multiallelic present-absent loci 5
name biallelic multiallelic null variants
Arlequin + + -
PHASE + + -
Haplo-IHP + - +
NullHap + + +
Table 1. Short comparison of analyzed applications to estimate haplotypes
Test on generated data sets
To compare the ability to calculate the haplotypes probabilities for consid-
ered applications the most probable sample was generated for five polymor-
phisms with varying locus characteristics. An example of assumed and es-
timated probabilities is shown in Tab. 2 (others examples are available at
web [11]). In Tab. 3 the results of simulations are summarized. The error was
calculated as provided in equation 11, where x is assumed probability, and x∗
is calculated one.
N
1 x − x∗
error = | | (11)
N i=1
x
haplotype probability hi
assumed Arlequin PHASE Haplo-IHP NullHap
A0B0 0.2 0.042 0.04 0.25 0.2
A0B1 0.2 0.126 0.12 0.25 0.2
A1B0 0.2 0.147 0.14 0.25 0.2
A1B1 0.2 0.442 0.42 0.25 0.2
A2B0 0.1 0.021 0.07 0 0.1
A2B1 0.1 0.221 0.21 0 0.1
error - 77% 67% 50% 0%
Table 2. Example of simulated data. Two loci polymorphism was considered: prob-
abilities were assumed for each haplotype, next the sample was generated, than
haplotype probabilities were estimated, finally results were compared.
Next, the effect of sample size on the exactness of the method were in-
vestigated, therefore the minimal sample size required to provide reliable es-
timations could be obtained. The 10 samples of 25,50,100,200,400 and 1000
from an infinite population in Hardy - Weinberg equilibrium characterized by
haplotype distribution as those given 4-th and 5-th example in tab. 3. The
haplotype probability were estimated, and median of errors (equation 11) are
illustrated in fig. 2.
6 Robert Nowak
No example error
description Arlequin PHASE Haplo-IHP NullHap
1 biallelic loci: A(A0, A1), B(B0, B1), 61% 50% 0.7% 0%
C(C0, C1), null variants: A0, B0 and C0
2 biallelic loci: A(A1, A2), B(B1, B2), no 0% 0% 0% 0%
null variants
3 multiallelic loci: A(A0, A1, A2), B(B0, 62% 62% 100% 0%
B1, B2), null variants: A0, B0
4 multiallelic loci: A(A1, A2, A3), B(B1, 0% 1% 78% 0%
B2, B3), no null variants
5 multiallelic loci, A(A0, A1, A2), B(B0, 77% 67% 50% 0%
B1), null variants: A0, B0
Table 3. Haplotype estimation probability error for considered applications
Fig. 2. Sample size effect on organism described in tab. 2. The figure describes the
median of error in function of sample size for two examples (example 4 and 5 from
tab. 3.
Test on real data sets
The applications able to analyze multiallelic loci (Arlequin, PHASE and Null-
Hap) were used to calculate HLA (DRB1-DQB1) data set for 99 Polish sup-
plied by [5]. The case contains two multiallelic loci (36 and 14 variants),
without null variants. The results, shown in tab. 4, are very similar for each
computer program.
The data set containing the KIR loci for 200 Irish [10], was used to check
the applications takes into account null variants (Haplo-IHP and NullHap).
Each locus is biallelic and has null variants. Results provided in tab. 5 shows
similar results.
Tests of simulated samples proves ability to estimate assumed haplotype
probabilities. In case of polymorphisms able to calculate by other known com-
puter programs the results produced by NullHap were similar.
Application to estimate haplotypes for multiallelic present-absent loci 7
haplotype Arlequin NullHap Phase haplotype Arlequin NullHap Phase
1500 0602 0.16 0.156 0.153 1301 0603 0.055 0.055 0.05
0700 0200 0.1 0.101 0.101 0401 0302 0.035 0.035 0.035
0101 0501 0.091 0.091 0.091 0104 0501 0.035 0.035 0.035
0301 0200 0.076 0.076 0.075 0806 0402 0.025 0.025 0.025
1101 0301 0.066 0.066 0.063 0700 0303 0.02 0.02 0.02
Table 4. The haplotype estimations for HLA data set for 99 Polish [5]. The differ-
ence between estimated probabilities between programs is less then 2%.
2DS2 2DL2 3DL3 2DL5B 2DL1 3DS1 Haplo-IHP NullHap
0 0 1 0 1 0 0.546 0.55
0 0 1 0 1 1 0.101 0.1
1 1 0 1 1 1 0.095 0.094
1 1 0 0 0 0 0.075 0.083
1 1 0 0 1 0 0.064 0.056
1 1 0 1 1 0 0.028 0.028
0 0 0 0 1 0 0.019 0.028
1 1 1 1 1 0 0.021 0.021
1 1 1 0 1 1 0.019 0.019
0 0 0 1 1 1 0.009 0.01
Table 5. The haplotype estimations for KIR data set, 200 Irish, [10]. The difference
of estimated probabilities between programs is about 3%.
Performance tests
The analysis of computational time in different scenarios is presented in tab. 6.
All computations were achieved on Celeron M 1.6GHz, 1GB RAM, under
Debian Linux or Windows XP. The number of loci and sample size influence
to computational time.
4 Conclusion
Presented application can effectively estimate haplotype probabilities with
performance similar to other computer programs. It should be emphasized
that the main advantage of created application is the ability to estimate hap-
lotypes for every type of polymorphism, in particular polymorphisms with
multiallelic loci with null variants. The tests for simulated data shows that
NullHap obtain results closer to assumed than others computer programs.
The demonstrated application is under development, and some improve-
ments are planned such as the ability to consider the cases of missing data
and the additional step removing unimportant haplotypes, to speed-up com-
putations. The new versions will be available at project web side.
8 Robert Nowak
number of time for application
loci haplotypes observations Arlequin Phase HaploIHP NullHap
2 6 100 0.13s 46s 0.5s 0.07s
2 9 100 0.06s 47s 0.15s 0.04s
3 8 100 0.04s 69s 0.58s 0.02s
2 504 99 0.22s 53s - 37s
2 540 99 0.34s 58s - 39s
7 128 200 - - 145s 13s
9 512 200 - - 1300s(8s) 209s
11 2048 200 - - 24h(10s) 3h
Table 6. Computational time for considered applications (HaploIHP in parenthesis
with greedy algorithm). Results only for applications able to estimate haplotypes
for given polymorphism.
The application would be confirmed by larger studies involving experimen-
tal work on allele typing.
References
1. Boettcher P, Pagnacco G, Stella A (2004) A Monte Carlo Approach for Estima-
tion of Haplotype Probabilities in Half-Sib Families. J. Dairy Sci., 87:4303–4310
2. Botstein D, Risch N (2003) Discovering genotypes underlying human phe-
notypes: past successes for Mendelian disease, future approaches for complex
disease. Nat.Genet., 33:228–237
3. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society, 39: 1–39
4. Douglas J at al. (2001) Experimentally derived haplotypes substantially in-
crease the efficiency of linkage disequilibrium studies. Nat. Genet., 28:361–364
5. Excoffier L, Laval G, Schneider S (2005) Arlequin ver. 3.0: An integrated
software package for population genetics data analysis. Evolutionary Bioinfor-
matics Online, 1:47–50
6. Excoffier L, M. Slatkin M (1995) Maximum-likelihood estimation of molecular
haplotype frequencies in a diploid population. Mol. Biol. Evol., 12:921–927
7. Morris R, Kaplan N (2002) On the advantage of haplotype analysis in the
presence of multiple disease susceptibility alleles. Gen. Epidem., 23:221–233
8. Niu T, Qin Z, Xu X, Liu J (2002) Bayesian haplotype inference for multiple
linked single nucleotide polymorphisms. Am. J. Hum. Genet., 70:157 - 169
9. Stephens M, Smith N, Donnelly P (2001) A new statistical method for haplo-
type reconstruction from population data. Am. J. Hum. Genet., 68:978–989
10. Yoo Y, Tang J, Kaslow R, Zhang K (2007) Haplotype inference for present-
absent genotype data using previously identified haplotypes and haplotype pat-
terns. Bioinformatics, 23:2399–2406
11. NullHap web side: http://staff.elka.pw.edu.pl/~ rnowak2/nullhap/
12. The faif library http://faif.sourceforge.net/