VIEWS: 14 PAGES: 27 POSTED ON: 1/28/2011
Cancer Biomarkers, Insufficient Data, Multiple Explanations Do not trust everything you read Side Topic: Biomarkers from mass spec data SELDI: Surface Enhanced Laser Desorption/Ionization. It combines chromatography with TOF-MS. Advantages of SELDI technology: Uses small amounts (< 1l/ 500-1000 cells) of sample (biopsies, microdissected tissue). Quickly obtain protein mapping from multiple samples at same conditions. Ideal for discovering biomarkers quickly. ProteinChip Arrays Chemical Surfaces (Hydrophobic) (Anionic) (Cationic) (Metal Ion) (Normal Phase) Biological Surfaces (PS10 or PS20) (Antibody - Antigen) (Receptor - Ligand) (DNA - Protein) SELDI Process 1. Add crude extract 2. Washing 3. Add EAM 4. Detection by TOF-MS EAM = Energy absorbing molecule copy from http://www.bmskorea.co.kr/new01_21-1.htm Protein mapping 7500 10000 12500 15000 4 2 C C-B8 0 4 7500 10000 12500 15000 3 2 C D-B9 1 0 7500 10000 12500 15000 4 3 N E-B10 2 1 0 4 7500 10000 12500 15000 3 N 2 C-B14 1 0 7500 10000 12500 15000 4 C D-B15 2 0 7500 7500 10000 10000 12500 12500 15000 15000 Biomarker Discovery Markers can be easily found by comparing (Normal) protein maps. SELDI is faster and more reproducible than 2D PAGE. (Cancer) Has been being used to discover protein biomarkers of diseases such as ovarian cancer, breast cancer, prostate and bladder cancers. Modified from Ciphergen Web Site) Inferring biomarkers Conrads-Zhou-Petercoin-Liotta-Veenstra (Cancer diagnosis using proteomic patterns, Expert. Rev. Mold Diagn. 3:4(2003) Genetic algorithm, found 10 proteins (out 411-420): of 15000 peaks!) as biomarker to separate about 100 spectra. Question: Are they the only separator? Might they be by-products of other more important proteins? Could some of them be noises/pollutants? Multiple Decision List Experiment (Research work with Jian Liu) We answer the first question. Last 2 are not our business. Decision List: If M1 then O1 else if M2 then O2 … else if Mj-1 then Oj-1 else Oj Each term Mi is a monomial of boolean variables. Each Oi denote positive or negative, for classification. k-decision list: Mi‟s are monomials of size k. It is pac- learnable (Rivest, 1987) Learning Algorithm for k-DL k-decision list Algorithm: Stage i: find a term that Covers only the yet-uncovered cancer (or normal) set The cover is largest Add the term to decision list as i-th term, with Oi=cancer (or normal) Repeat until, say, cancer set is fully covered. Then last Oj=normal. Multiple k-decision list: at each stage, find many terms that have the same coverage. Thousands of Solutions # proteins # nodes training # of equivalent testing accuracy used in 3-DL accuracy terms in each DL node max min 3 4 (43,79) (1,3,7,7) (45/81) (44/81) 5 5 (42,81) (1,5,1,24,24) (44/81) (43/81) 8 5 (43,81) (1,9,1,29,92) (43/81) (42/81) 10 5 (43,81) (1,9,3,46,175) (43/81) (42/81) 15 6 (45,80) (1,10,5,344,198,575) (46/81) (43/71) 20 5 (45,81) (1,10,5,121,833) (45/81) (43/73) 25 4 (45,81) (1,14,2,1350) (46/81) (44/80) 30 4 (45,81) (1,18,2,2047) (46/81) (44/80) ---------------2-DL ----------------------------------------------------------------------------- 50 4 (45/81) (1,4,8,703) (46/81) (42/78) 100 4 (45/81) (1,4,60,2556) (46/81) (42/76) WCX I: Performance of multiple decision lists. Notation: (x/y) = (#normal/#cancer) Example: 1*4*60*2556 > ½ million 2-decision lists (involving 7.8 proteins each on average) are “perfect” for (45+81) spectra, but most fail to classify 45+81+46+81 spectra perfectly (On average: such a random hypothesis cuts off 3 healthy women‟s ovary and leave one cancer undetected). Why should we trust that 10 particular proteins that are perfect for 216 spectra? Need a new learning theory to deal with small amount of data, too many relevant attributes. Excursion into Learning Theory OED: Induction is ``the process of inferring a general law or principle from the observations of particular instances''. Science is induction: from observed data to physical laws. But, how? … Occam‟s Razor Commonly attributed to William of Ockham (1290--1349). It states: Entities should not be multiplied beyond necessity. Commonly explained as: when have choices, choose the simplest theory. Bertrand Russell: ``It is vain to do with more what can be done with fewer.'„ Newton (Principia): ``Natura enim simplex est, et rerum causis superfluis non luxuriat''. Example. Inferring a DFA A DFA accepts: 1, 111, 11111, 1111111; and rejects: 11, 1111, 111111. What is it? 1 1 1 1 1 1 1 There are actually infinitely many DFAs satisfying these data. The first DFA makes a nontrivial inductive inference, the 2nd does not. Exampe. History of Science Maxwell's (1831-1879)'s equations say that: (a) An oscillating magnetic field gives rise to an oscillating electric field; (b) an oscillating electric field gives rise to an oscillating magnetic field. Item (a) was known from M. Faraday's experiments. However (b) is a theoretical inference by Maxwell and his aesthetic appreciation of simplicity. The existence of such electromagnetic waves was demonstrated by the experiments of H. Hertz in 1888, 8 years after Maxwell's death, and this opened the new field of radio communication. Maxwell's theory is even relativistically invariant. This was long before Einstein‟s special relativity. As a matter of fact, it is even likely that Maxwell's theory influenced Einstein‟s 1905 paper on relativity which was actually titled `On the electrodynamics of moving bodies'. J. Kemeny, a former assistant to Einstein, explains the transition from the special theory to the general theory of relativity: At the time, there were no new facts that failed to be explained by the special theory of relativity. Einstein was purely motivated by his conviction that the special theory was not the simplest theory which can explain all the observed facts. Reducing the number of variables obviously simplifies a theory. By the requirement of general covariance Einstein succeeded in replacing the previous „gravitational mass' and `inertial mass' by a single concept. Double helix vs triple helix --- 1953, Watson & Crick Counter Example. Once upon a time, there was a little girl named Emma. Emma had never eaten a banana, nor had she ever been on a train. One day she had to journey from New York to Pittsburgh by train. To relieve Emma's anxiety, her mother gave her a large bag of bananas. At Emma's first bite of her banana, the train plunged into a tunnel. At the second bite, the train broke into daylight again. At the third bite, Lo! into a tunnel; the fourth bite, La! into daylight again. And so on all the way to Pittsburgh. Emma, being a bright little girl, told her grandpa at the station: ``Every odd bite of a banana makes you blind; every even bite puts things right again.'„ (N.R. Hanson, Perception & Discovery) PAC Learning (L. Valiant, 1983) Fix a distribution for the sample space (P(v) for each v in sample space). A concept class C is pac-learnable (probably approximately correct learnable) iff there exists a learning algorithm A such that, for each f in C and e (0 < e < 1), algorithm A halts in a polynomial number of steps and examples, and outputs a concept h in C which satisfies the following. With probability at least 1- e, Σf(v) ≠ h (v) P(v) < e Simplicity means understanding We will prove that given a set of positive and negative data, any consistent concept of size `reasonably' shorter than the size of data is an `approximately' correct concept with high probability. That is, if one finds a shorter representation of data, then one learns. The shorter the conjecture is, the more efficiently it explains the data, hence the more precise the future prediction. Let α < 1, β ≥ 1, m be the number of examples, and s be the length (in number of bits) of the smallest concept in C consistent with the examples. An Occam algorithm is a polynomial time algorithm which finds a hypothesis h in C consistent with the examples and satisfying K(h) ≤ sβmα Occam Razor Theorem Theorem. A concept class C is polynomially pac-learnable if there is an Occam algorithm for it. I.e. With probability >1- e, Σf(v) ≠ h (v) P(v) < e Proof. Fix an error tolerance e (0 < e <1). Choose m such that m ≥ max { (2sβ/e)1/(1- α) , 2/e log 1/e }. This is polynomial in s and 1/ e. Let m be as above. Let S be a set of r concepts, and let f be one of them. Claim The probability that any concept h in S satisfies P(f ≠ h) ≥ e and is consistent with m independent examples of f is less than (1- e )m r. Proof: Let Eh be the event that hypothesis h agrees with all m examples of f. If P(h ≠ f ) ≥ e, then h is a bad hypothesis. That is, h and f disagree with probability at least e on a random example. The set of bad hypotheses is denoted by B. Since the m examples of f are independent, P( Eh ) ≤ (1- e )m . Since there are at most r bad hypotheses, P( Uh in B Eh) ≤ (1- e)m r. Proof of the theorem continues The postulated Occam algorithm finds a hypothesis of Kolmogorov complexity at most sβmα. The number r of hypotheses of this complexity satisfies log r ≤ sβmα . By assumption on m, r ≤ (1- e )-m/ 2 (Use e < - log (1- e) < e/(1-e) for 0 < e <1). Using the claim, the probability of producing a hypothesis with error larger than e is less than (1 - e )m r ≤ (1- e )m/2 < e. The last inequality is by substituting m. Inadequate data, Too many relevant attributes? Data in biotechnology is often expensive or hard to get. Pac-learning theory, MDL, SVM, Decision tree algorithm all need sufficient data. Similar situation in expression arrays – where too many attributes are relevant: which ones to choose? Epicurus: Multiple Explanations Greek philosopher of science Epicurus (342-- 270BC) proposed the Principle of Multiple Explanations: If more than one theory is consistent with the observations, keep all theories. 1500 years before Occam‟s razor! ``There are also some things for which it is not enough to state a single cause, but several, of which one, however, is the case. Just as if you were to see the lifeless corpse of a man lying far away, it would be fitting to state all the causes of death in order that the single cause of this death may be stated. For you would not be able to establish conclusively that he died by the sword or of cold or of illness or perhaps by poison, but we know that there is something of this kind that happened to him.'„ [Lucretius] Can the two theories be integrated? When we do not have enough data, Epicurus said that we should just be happy to keep all the alternative consistent hypotheses, not selecting the simplest one. But how can such a philosophical idea be converted to concrete mathematics and learning algorithms? A Theory of Learning with Insufficient Data Definition. With the pac-learning notations, a concept class is polynomially Epicurus-learnable iff the learning algorithm always halts within time and number of examples p(|f|, 1/ε), for some polynomial p, with a list of hypotheses of which one is probably approximately correct. Definition Let α < 1 and β ≥ 1 be constants, m be the number of examples, and s be the length (in number of bits) of the smallest concept in C consistent with the examples. An Epicurus algorithm is a polynomial time algorithm which finds a collection of hypotheses h1, … hk in C that they are all consistent with the examples; they satisfy K(hi ) ≤ sβ mα , for i=1, … , k, where K(x) is Kolmogorov complexity of x. they are mutually error-independent with respect to the true hypothesis h, that is: h1 ∆ h, … , hk ∆ h are mutually independent, where hi ∆ h is the symmetric difference of the two concepts. Theorem. A concept class C is polynomially Epicurus-learnable if there is an Epicurus algorithm outputting k hypotheses and using m ≥ 1/k max{ (2sβ/ε)1/(1-α), 2/ε log 1/ε } examples, where 0<ε<1 is error tolerance. This theorem gives a sample-size vs learnability tradeoff. When k=1, then this becomes old Occam‟s Razor theorem. Admittedly, error-independence requirement is too strong to be practical. Proof. Let m be as in theorem, C contain r concepts and f be one of them, and h1 … hk be the k error-independent hypotheses from the Epicurus algorithm. Claim. The probability that h1 … hk in C satisfy P(f ≠ hi ) ≥ ε and are consistent with m independent examples of f is less than (1- ε )km Crk. Proof of Claim. Let E1..k be the event that hypothesis h1 … hk all agree with all m examples of f. If P(hi ≠ f ) ≥ ε, for i=1 … k, then, since the m examples of f are independent and hi's are mutually f-error-independent, P( E(h1 … hk) ) ≤ (1- ε )km . Since there are at most Crk sets of bad hypothesis choices, P( U E(h1 … hk | hi‟s are bad hypotheses) ≤ (1- ε )km Crk. ■ The postulated Epicurus algorithm finds k consistent hypotheses of Kolmogorov complexity at most sβmα The number r of hypotheses of this complexity satisfies log r ≤ sβ mα . By assumption on m, Crk ≤ (1- ε )-km/2 Using the claim, the probability all k hypotheses having error larger than ε is less than (1 - ε )km Crk ≤ (1- ε )km/2 Substituting m we find that the right-hand side is at most ε. ■ When there is not enough data to assure that the Occam learning converges, do Epicurus learning and leave the final selection to the experts.