Causal Inference因果推论
Of Intermediate 中级 Phenotypes表型 and
Biomarkers 生物标记 in Rheumatoid Arthritis 风湿性关节炎
[An Application of Machine Learning 机器学习 Techniques to Genetic
Epidemiology 遗传流行病学]
Wentian Li 李问天, Ph.D
Feinstein Institute for Medical Research
Wentian Li, North Shore LIJ
11/23/2011 1
Health System
Genetic Association
Association 相关 is not equivalent to causal 因果
的 relationship
Wrinkle-Cancer risk association does not mean
one causes 导致 another
Age is a confounding factor 混杂因素
Wentian Li, North Shore LIJ
11/23/2011 2
Health System
When do we need to
know cause and effect?
Rarely discussed in genetic analysis because
genotype is always the cause 原因, and
phenotype is always the effect 效果
In epidemiology 流行病学 factor 因素-disease 疾
病 association can belong to three situations (1)
factor is a cause; (2) reverse causality; (3) a third
confounding factor
For two intermediate phenotypes (biomarkers),
causal arrow can point either way
Wentian Li, North Shore LIJ
11/23/2011 3
Health System
Causal Inference in
Machine Learning
Large text database (e.g. google)
Observational data (no controlled experiment,
and no other approaches to determine
causality)
Two-point association indeed cannot be used
to claim causality
The key is a third variable, as well as
conditional 条件的 association based on the
third variable
Wentian Li, North Shore LIJ
11/23/2011 4
Health System
Wentian Li, North Shore LIJ
11/23/2011 5
Health System
Wentian Li, North Shore LIJ
11/23/2011 6
Health System
Data Mining and Knowledge Discovery (2000) v4, pp.163-192
Wentian Li, North Shore LIJ
11/23/2011 7
Health System
An Example
Association? X and Y X and Z Y and Z
unconditional YES YES YES
Conditional YES NO YES
on 3rd var
Wentian Li, North Shore LIJ
11/23/2011 8
Health System
Cooper’s Local Causality Discovery
(LCD) Rule
Six assumptions: 1.database completeness. 2.
discrete variables. 3. Bayesian network model
(directed acyclic 非环式的 graph: no loops). 4….
5. no selection bias. 6. valid statistical testing.
Three variables: x,y,z
Hidden 潜在的 variable is allowed (but not in the
dataset)
Determine three correlations: unconditional
C(x,y), C(y,z) and conditional C(x,z|y)
Wentian Li, North Shore LIJ
11/23/2011 9
Health System
Between two variables, there are
only 6(4) causal relationships
(allowing confounding variable)
confounding
no relationship
confounding+causing
causing
NO NO
Reverse causing confounding plus rev causing
Wentian Li, North Shore LIJ
11/23/2011 10
Health System
Number of causal relationships
among three variables
6x6x6=216 possibilities
4x4x6=96 if x is not caused by either y or z
(but can receive an arrow from a hidden
variable) [Cooper’97 paper]
2x2x6=24 if x doesn’t even receive an arrow
from hidden confounding variables [Li and
Wang, unpublished]
Wentian Li, North Shore LIJ
11/23/2011 11
Health System
Given a causal model…
Unconditional 无条件 association
between any two variables can be
determined by whether they are
connected by a path
Conditional 条件的 association
can be determined by the so-
called “d-separation” rule
Wentian Li, North Shore LIJ
11/23/2011 12
Health System
“CCC” causal inference rule
(Cooper version) if C(x,y)+, C(y,z)+, but C(x,z|y)-,
then there are only three possible causal models:
x => y => z
x y => z
h =>x => y =>z
(Silverstein et al. version) if C(x,y)+, C(y,z)+,
C(x,z)+, but C(x,z|y)-, C(x,y|y)+, C(y,z|x)+, then...
Wentian Li, North Shore LIJ
11/23/2011 13
Health System
In a three-way
correlated set
If one of the variable (x) is not an effect (only a cause)
AND
If correlation is lost between x and z conditionally,
THEN
y causes z
x: gene
y,z: two intermediate phenotypes
Wentian Li, North Shore LIJ
11/23/2011 14
Health System
The use of a not-a-effect variable has
an amazing parallel in epidemiology
Called “instrumental variable”
Martjin Katan’s idea on cholesterol 胆固醇
cancer 癌症 association: he proposed to use
a genotype (apoliprotein 载脂蛋白 E) as the
third variable (Lancer 1986, i:507-508)
Katan did not use conditional correlation
This idea is now called “Mendelian
randomization”
Wentian Li, North Shore LIJ
11/23/2011 15
Health System
Wentian Li, North Shore LIJ
11/23/2011 16
Health System
Rheumatoid Arthritis (RA)
An autoimmune 自我免疫的 disease
Chronic inflammation 炎症 of joints 关节
Three times more likely to occur in women than
men
Age of onset 40-60
Twin 双胞胎 concordance rates: 12-15% for MZ
单合子,单卵双生, 5% for DZ 异卵双生
Genetic and environmental (e.g. smoking) risk
factors
Wentian Li, North Shore LIJ
11/23/2011 17
Health System
MHC/HLA: the main genetic
contribution of RA
MHC (Major Histocompatibility Complex主要组织相容性
复合体) or HLA (Human leukocyte antigens 人类白血球抗
原): HLA-DRB1 gene on chromosome 6 (6p21.3)
The RA associated alleles are HLA-DRB1*0401, *0404,
*0408 (Caucasian), not *0402, *0403, *0407
In Asian population, different DRB1 alleles are
associated with RA (e.g. *0405, *0901)
A group of DRB1 risk alleles are called “shared epitope”
(SE) 共同表位, or rheumatoid epitope, code position 70-
74 amino acids in the third hypervariable region
Wentian Li, North Shore LIJ
11/23/2011 18
Health System
Two Auto-antibodies are strongly
associated with RA: RF and anti-CCP
RF (rheumatoid factor 类风湿因子): 80% of RA patients
are RF positive
anti-CCP (anti-cyclic citrullinated peptide antibody 抗环瓜
氨酸肽抗体,抗CCP抗体): even better predictor of RA in
early stage
HLA-DRB1, RF, anti-CCP are all associated with the RA
disease, and they are associated with each other. CCC
rule can be applied!
张利方,阎有功,黄前川,等, “抗环瓜氨酸肽抗体在类风湿性关节炎诊断中的应用”, 免疫学杂
志,2004,20:52-57
Wentian Li, North Shore LIJ
11/23/2011 19
Health System
Q: Between RF and anti-
CCP, which one is the
cause and which is the
effect?
Wentian Li, North Shore LIJ
11/23/2011 20
Health System
1723 Caucasian RA patients
anti-CCP positive anti-CCP negative
SE+ SE- SE+ SE-
RF+ 960 128 RF+ 95 74
RF- 84 19 RF- 214 149
Wentian Li, North Shore LIJ
11/23/2011 21
Health System
Association between RF and DRB1 genotype is lost
conditional on anti-CCP
Wentian Li, North Shore LIJ
11/23/2011 22
Health System
By the CCC rule, anti-
CCP is the cause, RF
is the effect
Or, anti-CCP is upstream and
RF is downstream in a pathway
Wentian Li, North Shore LIJ
11/23/2011 23
Health System
Discussions/Issues
There are evidences that RA patients become anti-CCP
positive before becoming RF positive
The three-way correlation might be lost in normal
controls (here we have a “case-only” analysis)
In-between anti-CCP and RF, other factors are possible
(so the cause-effect may not be direct)
It is not clear where the smoking factor comes in (could
be an intriguing analysis with smoking data!)
Wentian Li, North Shore LIJ
11/23/2011 24
Health System
Revisit Katan’s “Mendelian Randomization”
(MR) by LCD [Wang, Li, unpublished]
MR needs a not-an- LCD needs a variable
effect variable (gene) that is not a cause
Conditional association Conditional association
is not used is used
Only need a counter Complete information of
example (e.g. Apo E2 (G, IP, D) trio for all
samples have low samples (e.g. Apo
cholesterol, but NOT genotype, cholesterol
high cancer risk) level, cancer status)
Wentian Li, North Shore LIJ
11/23/2011 25
Health System
Co-Authors
Mingyi WANG (Zhejiang Univ,
Computer Science Department,
causal inference)
Patricia Irigoyen, Peter Gregersen
(North Shore LIJ, RA data)
Wentian Li, North Shore LIJ
11/23/2011 26
Health System
Wentian Li, North Shore LIJ
11/23/2011 27
Health System