Learning Center
Plans & pricing Sign in
Sign Out



									 Cancer Biomarkers,
     Insufficient Data,
Multiple Explanations
Do not trust everything you read
 Side Topic: Biomarkers from mass spec data

  Surface Enhanced Laser Desorption/Ionization. It
   combines chromatography with TOF-MS.
 Advantages of SELDI technology:
  Uses small amounts (< 1l/ 500-1000 cells) of sample
   (biopsies, microdissected tissue).
  Quickly obtain protein mapping from multiple samples
   at same conditions.
  Ideal for discovering biomarkers quickly.
                                      ProteinChip Arrays

Chemical Surfaces

    (Hydrophobic)         (Anionic)            (Cationic)            (Metal Ion)       (Normal Phase)

    Biological Surfaces

         (PS10 or PS20)         (Antibody - Antigen)        (Receptor - Ligand)    (DNA - Protein)
                             SELDI Process

                                   1. Add crude extract

              2. Washing

                                    3. Add EAM

4. Detection by TOF-MS
EAM = Energy absorbing molecule
                           copy from
                   Protein mapping
    7500   10000   12500   15000


                                   C   C-B8

    7500   10000   12500   15000

                                   C   D-B9



    7500   10000   12500   15000
3                                  N   E-B10

4   7500   10000   12500   15000

3                                  N
2                                      C-B14



    7500   10000   12500   15000
                                   C   D-B15


    7500   10000
           10000   12500
                   12500   15000
Biomarker Discovery

  Markers can be easily
   found by comparing
   protein maps.
  SELDI is faster and
   more reproducible than
   2D PAGE.                                           (Cancer)

  Has been being used to
   discover protein
   biomarkers of diseases
   such as ovarian cancer,
   breast cancer, prostate
   and bladder cancers.
                             Modified from Ciphergen Web Site)
Inferring biomarkers
 Conrads-Zhou-Petercoin-Liotta-Veenstra (Cancer
  diagnosis using proteomic patterns, Expert. Rev. Mold Diagn. 3:4(2003)
         Genetic algorithm, found 10 proteins (out
  of 15000 peaks!) as biomarker to separate about
  100 spectra.
 Question:
   Are they the only separator?
   Might they be by-products of other more important
   Could some of them be noises/pollutants?
Multiple Decision List Experiment
(Research work with Jian Liu)
 We answer the first question. Last 2 are not our
 Decision List:
    If M1 then O1
     else if M2 then O2
        else if Mj-1 then Oj-1
          else Oj
 Each term Mi is a monomial of boolean variables. Each
  Oi denote positive or negative, for classification.
 k-decision list: Mi‟s are monomials of size k. It is pac-
  learnable (Rivest, 1987)
Learning Algorithm for k-DL
 k-decision list Algorithm:
Stage i: find a term that
   Covers only the yet-uncovered cancer (or normal) set
   The cover is largest
 Add the term to decision list as i-th term, with
  Oi=cancer (or normal)
Repeat until, say, cancer set is fully covered. Then
  last Oj=normal.
 Multiple k-decision list: at each stage, find many
  terms that have the same coverage.
Thousands of Solutions
# proteins # nodes training                  # of equivalent              testing accuracy
 used          in 3-DL accuracy terms in each DL node                       max         min
    3             4       (43,79)                (1,3,7,7)                (45/81) (44/81)
    5             5       (42,81)             (1,5,1,24,24)               (44/81) (43/81)
    8             5       (43,81)             (1,9,1,29,92)               (43/81) (42/81)
   10             5       (43,81)             (1,9,3,46,175)              (43/81) (42/81)
   15             6       (45,80)         (1,10,5,344,198,575) (46/81) (43/71)
   20             5       (45,81)            (1,10,5,121,833)             (45/81) (43/73)
   25             4       (45,81)              (1,14,2,1350)              (46/81) (44/80)
   30             4       (45,81)              (1,18,2,2047)              (46/81) (44/80)
  ---------------2-DL -----------------------------------------------------------------------------
   50             4       (45/81)              (1,4,8,703)                (46/81) (42/78)
  100             4       (45/81)              (1,4,60,2556)              (46/81) (42/76)

WCX I: Performance of multiple decision lists. Notation: (x/y) = (#normal/#cancer)

   Example: 1*4*60*2556 > ½ million 2-decision lists (involving 7.8 proteins each on average) are
    “perfect” for (45+81) spectra, but most fail to classify 45+81+46+81 spectra perfectly (On average:
    such a random hypothesis cuts off 3 healthy women‟s ovary and leave one cancer undetected).
    Why should we trust that 10 particular proteins that are perfect for 216 spectra?
   Need a new learning theory to deal with small amount of data, too many relevant attributes.
Excursion into Learning Theory

OED: Induction is ``the process of inferring
 a general law or principle from the
 observations of particular instances''.
Science is induction: from observed data
 to physical laws.

But, how? …
Occam‟s Razor
 Commonly attributed to William of Ockham
  (1290--1349). It states: Entities should not be
  multiplied beyond necessity.
 Commonly explained as: when have choices,
  choose the simplest theory.
 Bertrand Russell: ``It is vain to do with more
  what can be done with fewer.'„
 Newton (Principia): ``Natura enim simplex est, et
  rerum causis superfluis non luxuriat''.
Example. Inferring a DFA
A DFA accepts: 1, 111, 11111, 1111111;
 and rejects: 11, 1111, 111111. What is it?

         1           1   1   1   1   1

There are actually infinitely many DFAs
 satisfying these data.
The first DFA makes a nontrivial inductive
 inference, the 2nd does not.
Exampe. History of Science
  Maxwell's (1831-1879)'s equations say that: (a) An oscillating magnetic
  field gives rise to an oscillating electric field; (b) an oscillating electric field
  gives rise to an oscillating magnetic field. Item (a) was known from M.
  Faraday's experiments. However (b) is a theoretical inference by Maxwell
  and his aesthetic appreciation of simplicity. The existence of such
  electromagnetic waves was demonstrated by the experiments of H. Hertz in
  1888, 8 years after Maxwell's death, and this opened the new field of radio
  communication. Maxwell's theory is even relativistically invariant. This was
  long before Einstein‟s special relativity. As a matter of fact, it is even likely
  that Maxwell's theory influenced Einstein‟s 1905 paper on relativity which
  was actually titled `On the electrodynamics of moving bodies'.
 J. Kemeny, a former assistant to Einstein, explains the transition from the
  special theory to the general theory of relativity: At the time, there were no
  new facts that failed to be explained by the special theory of relativity.
  Einstein was purely motivated by his conviction that the special theory was
  not the simplest theory which can explain all the observed facts. Reducing
  the number of variables obviously simplifies a theory. By the requirement of
  general covariance Einstein succeeded in replacing the previous
  „gravitational mass' and `inertial mass' by a single concept.
 Double helix vs triple helix --- 1953, Watson & Crick
Counter Example.
 Once upon a time, there was a little girl named Emma.
  Emma had never eaten a banana, nor had she ever
  been on a train. One day she had to journey from New
  York to Pittsburgh by train. To relieve Emma's anxiety,
  her mother gave her a large bag of bananas. At Emma's
  first bite of her banana, the train plunged into a tunnel. At
  the second bite, the train broke into daylight again. At the
  third bite, Lo! into a tunnel; the fourth bite, La! into
  daylight again. And so on all the way to Pittsburgh.
  Emma, being a bright little girl, told her grandpa at the
  station: ``Every odd bite of a banana makes you blind;
  every even bite puts things right again.'„ (N.R. Hanson,
  Perception & Discovery)
PAC Learning (L. Valiant, 1983)
 Fix a distribution for the sample space (P(v) for
  each v in sample space). A concept class C is
  pac-learnable (probably approximately correct
  learnable) iff there exists a learning algorithm A
  such that, for each f in C and e (0 < e < 1),
  algorithm A halts in a polynomial number of
  steps and examples, and outputs a concept h in
  C which satisfies the following. With probability
  at least 1- e,
          Σf(v) ≠ h (v) P(v) < e
Simplicity means understanding
 We will prove that given a set of positive and negative
  data, any consistent concept of size `reasonably' shorter
  than the size of data is an `approximately' correct
  concept with high probability. That is, if one finds a
  shorter representation of data, then one learns. The
  shorter the conjecture is, the more efficiently it explains
  the data, hence the more precise the future prediction.
 Let α < 1, β ≥ 1, m be the number of examples, and s be
  the length (in number of bits) of the smallest concept in
  C consistent with the examples. An Occam algorithm is a
  polynomial time algorithm which finds a hypothesis h in
  C consistent with the examples and satisfying
             K(h) ≤ sβmα
Occam Razor Theorem
Theorem. A concept class C is polynomially pac-learnable if there is an
   Occam algorithm for it. I.e. With probability >1- e, Σf(v) ≠ h (v) P(v) < e
   Proof. Fix an error tolerance e (0 < e <1). Choose m such that
           m ≥ max { (2sβ/e)1/(1- α) , 2/e log 1/e }.
   This is polynomial in s and 1/ e. Let m be as above. Let S be a set of
   r concepts, and let f be one of them.
Claim The probability that any concept h in S satisfies P(f ≠ h) ≥ e and
   is consistent with m independent examples of f is less than (1- e )m r.
   Proof: Let Eh be the event that hypothesis h agrees with all m
   examples of f. If P(h ≠ f ) ≥ e, then h is a bad hypothesis. That is, h
   and f disagree with probability at least e on a random example. The
   set of bad hypotheses is denoted by B. Since the m examples of f
   are independent,
            P( Eh ) ≤ (1- e )m .
   Since there are at most r bad hypotheses,
            P( Uh in B Eh) ≤ (1- e)m r.
Proof of the theorem continues
The postulated Occam algorithm finds a
  hypothesis of Kolmogorov complexity at most
  sβmα. The number r of hypotheses of this
  complexity satisfies
       log r ≤ sβmα .
By assumption on m, r ≤ (1- e )-m/ 2 (Use e < - log
  (1- e) < e/(1-e) for 0 < e <1). Using the claim, the
  probability of producing a hypothesis with error
  larger than e is less than
         (1 - e )m r ≤ (1- e )m/2 < e.
The last inequality is by substituting m.
Inadequate data, Too many relevant
Data in biotechnology is often expensive
 or hard to get.
Pac-learning theory, MDL, SVM, Decision
 tree algorithm all need sufficient data.
Similar situation in expression arrays –
 where too many attributes are relevant:
 which ones to choose?
Epicurus: Multiple Explanations
 Greek philosopher of science Epicurus (342--
  270BC) proposed the Principle of Multiple
  Explanations: If more than one theory is
  consistent with the observations, keep all
  theories. 1500 years before Occam‟s razor!
 ``There are also some things for which it is not enough to state a
  single cause, but several, of which one, however, is the case. Just
  as if you were to see the lifeless corpse of a man lying far away, it
  would be fitting to state all the causes of death in order that the
  single cause of this death may be stated. For you would not be able
  to establish conclusively that he died by the sword or of cold or of
  illness or perhaps by poison, but we know that there is something of
  this kind that happened to him.'„ [Lucretius]
Can the two theories be integrated?

When we do not have enough data,
 Epicurus said that we should just be happy
 to keep all the alternative consistent
 hypotheses, not selecting the simplest
 one. But how can such a philosophical
 idea be converted to concrete
 mathematics and learning algorithms?
A Theory of Learning with Insufficient Data
 Definition. With the pac-learning notations, a concept class is
   polynomially Epicurus-learnable iff the learning algorithm always
   halts within time and number of examples p(|f|, 1/ε), for some
   polynomial p, with a list of hypotheses of which one is probably
   approximately correct.

 Definition Let α < 1 and β ≥ 1 be constants, m be the number of
   examples, and s be the length (in number of bits) of the smallest
   concept in C consistent with the examples. An Epicurus algorithm is
   a polynomial time algorithm which finds a collection of hypotheses
   h1, … hk in C that
      they are all consistent with the examples;
      they satisfy K(hi ) ≤ sβ mα , for i=1, … , k, where K(x) is Kolmogorov
       complexity of x.
      they are mutually error-independent with respect to the true hypothesis
       h, that is: h1 ∆ h, … , hk ∆ h are mutually independent, where hi ∆ h is
       the symmetric difference of the two concepts.
Theorem. A concept class C is polynomially
   Epicurus-learnable if there is an Epicurus
   algorithm outputting k hypotheses and using
     m ≥ 1/k max{ (2sβ/ε)1/(1-α), 2/ε log 1/ε }
   examples, where 0<ε<1 is error tolerance.

 This theorem gives a sample-size vs
  learnability tradeoff. When k=1, then this
  becomes old Occam‟s Razor theorem.
 Admittedly, error-independence requirement is
  too strong to be practical.
Proof. Let m be as in theorem, C contain r concepts and f be one of them,
   and h1 … hk be the k error-independent hypotheses from the Epicurus

Claim. The probability that h1 … hk in C satisfy P(f ≠ hi ) ≥ ε and are
   consistent with m independent examples of f is less than (1- ε )km Crk.
Proof of Claim. Let E1..k be the event that hypothesis h1 … hk all agree with
   all m examples of f. If P(hi ≠ f ) ≥ ε, for i=1 … k, then, since the m
   examples of f are independent and hi's are mutually f-error-independent,
            P( E(h1 … hk) ) ≤ (1- ε )km .
    Since there are at most Crk sets of bad hypothesis choices,
              P( U E(h1 … hk | hi‟s are bad hypotheses) ≤ (1- ε )km Crk.

   The postulated Epicurus algorithm finds k consistent hypotheses of
   Kolmogorov complexity at most sβmα The number r of hypotheses of this
   complexity satisfies log r ≤ sβ mα . By assumption on m, Crk ≤ (1- ε )-km/2
   Using the claim, the probability all k hypotheses having error larger than
   ε is less than
             (1 - ε )km Crk ≤ (1- ε )km/2
    Substituting m we find that the right-hand side is at most ε.     ■
When there is not enough data to assure
 that the Occam learning converges, do
 Epicurus learning and leave the final
 selection to the experts.

To top