Document Sample

IST 511 Information Management: Information and Technology Probabilistic reasoning Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology Professor of Computer Science and Engineering Professor of Supply Chain and Information Systems The Pennsylvania State University, University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Special thanks to J. Lafferty, T. Cover, R.V. Jones Today What are probabilities What is information theory What is probabilistic reasoning – Definitions – Why important – How used – decision making – Decision trees Impact on information science Tomorrow Topics used in IST • Data mining, information extraction • Metadata; digital libraries, scientometrics • Others? Theories in Information Sciences Enumerate some of these theories in this course. Issues: – Unified theory? – Domain of applicability – Conflicts Theories here are – Very algorithmic – Some quantitative – Some qualitative Quality of theories – Occam’s razor – Subsumption of other theories (foundational) Theories of reasoning – Cognitive, algorithmic, social Probability vs all the others Probability theory • the branch of mathematics concerned with analysis of random phenomena. • Randomness: a non-order or non-coherence in a sequence of symbols or steps, such that there is no intelligible pattern or combination. • The central objects of probability theory are random variables, stochastic processes, and events • mathematical abstractions of non-deterministic events or measured quantities that may either be single occurrences or evolve over time in an apparently random fashion. Uncertainty – A lack of knowledge about an event – Can be represented by a probability • Ex: role a die, draw a card – Can be represented as an error Statistic (a measure in statistics) – Can use probability in determining that measure Founders of Probability Theory Blaise Pascal Pierre Fermat (1623-1662, France) (1601-1665, France) Laid the foundations of the probability theory in a correspondence on a dice game posed by a French nobleman. Sample Spaces – measures of events Collection (list) of all possible outcomes EXPERIMENT: ROLL A DIE – e.g.: All six faces of a die: EXPERIMENT: DRAW A CARD e.g.: All 52 cards in a deck: Types of Events Simple event – Outcome from a sample space with one characteristic in simplest form – e.g.: King of clubs from a deck of cards Joint event – Conjunction (AND); disjunction (OR) – Contains several simple events – e.g.: A red ace from a deck of cards – (ace on hearts OR ace of diamonds) Visualizing Events Excellent ways of determining probabilities: Contingency tables (neat way to look at): Ace Not Ace Total Black 2 24 26 Red 2 24 26 Total 4 48 52 Tree diagrams: Ace Red Full Cards Not an Ace Deck Black Ace of Cards Cards Not an Ace Review of Probability Rules Given 2 events: G, H 1) P(G OR H) = P(G) + P(H) - P(G AND H); for mutually exclusive events, P(G AND H) = 0 2) P(G and H) = P(G)P(HG), also written as P(HG) = P(G and H)/P(G) 3) If G and H are independent, P(HG) = P(H), thus P(G AND H) = P(G)P(H) 4) P(G) > P(G)P(H); P(H) > P(G)P(H) Odds Another way to express probability is in terms of odds, d d = p/(1-p) p = probability of an outcome Example: What are the odds of getting a six on a dice throw? We know that p=1/6, so d = 1/6/(1-1/6) = (1/6)/(5/6) = 1/5. Gamblers often turn it around and say that the odds against getting a six on a dice roll are 5 to 1. Probabilistic Reasoning • Reasoning using probabilistic methods • Reasoning with uncertainty • Rigorous reasoning vs heuristics or biases Heuristics and Biases in Reasoning Tversky & Kahneman showed that people often do not follow rules of probability Instead, decision making may be based on heuristics (heuristic decision making) Lower cognitive load but may lead to systematic errors and biases Example heuristics – Representativeness – Availability – Conjunctive fallacy Gambling/Predictions A fair coin is flipped. H heads, T tails - What is a more likely sequence? A) H T H T T H B)H H H H H H -What is the result more likely to - follow A)? - follow B)? Decision Tree Representativeness Heuristic The sequence “H T H T T H” is seen as more representative of or similar to a prototypical coin sequence While each sequence has the same probability of occurring The likelihood of a flip following both A and B are the same: ½ H; ½ T The T for B) is no more likely; events are independent Gambler’s Fallacy When is this not the case? Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Please choose the most likely alternative: (a) Linda is a bank teller (b) Linda is a bank teller and is active in the feminist movement Conjunction Fallacy Nearly 90% choose the second alternative (bank teller and active in the feminist movement), even though it is logically incorrect (conjunction fallacy) bank tellers feminists bank tellers feminists who are not who are not feminists feminist bank tellers bank tellers P(A) > P(A,B); P(B) > P(A,B) Kahnemann and Tversky (1982) How to avoid these mistakes Such mistakes can cause bad decisions and loss of • Profits • Lives • Health • Justice (prosecutor’s fallacy) • Etc. Instead, use probabilistic methods Example Reasoning with an Uncertain Agent sensors ? ? environment agent ? actuators model An Old Problem … Getting Somewhere Types of Uncertainty For example, to drive your car in the morning: Uncertainty in prior knowledge • It must not have been stolen during the night E.g., some causes of a disease are unknown and are not • It must not have flat tires represented in the background knowledge of a medical- • There must be gas in the tank assistant agent • The battery actions Uncertainty inmust not be dead • The ignition mustrepresented with relatively short lists of E.g., actions are work preconditions, while these lists are • You must not have lost the car keys in fact arbitrary long • No truck should obstruct the driveway • You must not have suddenly become blind or paralytic Etc… Not only would it not be possible to list all of them, but would trying to do so be efficient? Questions How to represent uncertainty in knowledge? How to reason (inferences) with uncertain knowledge? Which action to choose under uncertainty? Handling Uncertainty Possible Approaches: 1. Default reasoning 2. Worst-case reasoning 3. Probabilistic reasoning Default Reasoning Creed: The world is fairly normal. Abnormalities are rare So, an agent assumes normality, until there is evidence of the contrary E.g., if an agent sees a bird x, it assumes that x can fly, unless it has evidence that x is a penguin, an ostrich, a dead bird, a bird with broken wings, … Worst-Case Reasoning Creed: Just the opposite! The world is ruled by Murphy’s Law Uncertainty is defined by sets, e.g., the set possible outcomes of an action, the set of possible positions of a robot The agent assumes the worst case, and chooses the actions that maximizes a utility function in this case Example: Adversarial search Probabilistic Reasoning Creed: The world is not divided between “normal” and “abnormal”, nor is it adversarial. Possible situations have various likelihoods (probabilities) The agent has probabilistic beliefs – pieces of knowledge with associated probabilities (strengths) – and chooses its actions to maximize the expected value of some utility function Notion of Probability You drive on Atherton often, and you notice that 70% P(AvA) = of the times there is a traffic slowdown at the1exit to Park. The you will P(A) next time you plan to drive on Atherton, = P(A) +believe that So: the proposition “there is a slowdown at the exit to Park” is True with probability 0.7 P(A) = 1 - P(A) The probability of a proposition A is a real number P(A) between 0 and 1 P(True) = 1 and P(False) = 0 P(AvB) = P(A) + P(B) - P(AB) Axioms of probability Interpretations of Probability Frequency Subjective Frequency Interpretation Draw a ball from a bag containing n balls of the same size, r red and s yellow. The probability that the proposition A = “the ball is red” is true corresponds to the relative frequency with which we expect to draw a red ball P(A) = r/n Subjective Interpretation There are many situations in which there is no objective frequency interpretation: – On a windy day, just before paragliding from the top of El Capitan, you say “there is probability 0.05 that I am going to die” – You have worked hard in this class and you believe that the probability that you will get an A is 0.9 Random Variables A proposition that takes the value True with probability p and False with probability 1-p is a random variable with distribution (p,1-p) If a bag contains balls having 3 possible colors – red, yellow, and blue – the color of a ball picked at random from the bag is a random variable with 3 possible values The (probability) distribution of a random variable X with n values x1, x2, …, xn is: (p1, p2, …, pn) with P(X=xi) = pi and Si=1,…,n pi = 1 Expected Value Random variable X with n values x1,…,xn and distribution (p1,…,pn) E.g.: X is the state reached after doing an action A under uncertainty Function U of X E.g., U is the utility of a state The expected value of U after doing A is E[U] = Si=1,…,n pi U(xi) Toothache Example A certain dentist is only interested in two things about any patient, whether he has a toothache and whether he has a cavity Over years of practice, she has constructed the following joint distribution: Toothache Toothache Cavity 0.04 0.06 Cavity 0.01 0.89 Joint Probability Distribution k random variables X1, …, Xk The joint distribution of these variables is a table in which each entry gives the probability of one combination of values of X1, …, Xk Example: Toothache Toothache Cavity 0.04 0.06 Cavity 0.01 0.89 P(CavityToothache) P(CavityToothache) Joint Distribution Says It All Toothache Toothache Cavity 0.04 0.06 Cavity 0.01 0.89 P(Toothache) = P((Toothache Cavity) v (ToothacheCavity)) = P(Toothache Cavity) + P(ToothacheCavity) = 0.04 + 0.01 = 0.05 P(Toothache v Cavity) = P((Toothache Cavity) v (ToothacheCavity) v (Toothache Cavity)) = 0.04 + 0.01 + 0.06 = 0.11 Conditional Probability Definition: – P(AB) = P(A|B) P(B) Read P(A|B): Probability of A given that we know B P(A) is called the prior probability of A P(A|B) is called the posterior or conditional probability of A given B Example Toothache Toothache Cavity 0.04 0.06 Cavity 0.01 0.89 P(CavityToothache) = P(Cavity|Toothache) P(Toothache) P(Cavity) = 0.1 P(Cavity|Toothache) = P(CavityToothache) / P(Toothache) = 0.04/0.05 = 0.8 Generalization P(A B C) = P(A|B,C) P(B|C) P(C) Conditional Independence Propositions A and B are (conditionally) independent iff: P(A|B) = P(A) P(AB) = P(A) P(B) A and B are independent given C iff: P(A|B,C) = P(A|C) P(AB|C) = P(A|C) P(B|C) Conditional Independence Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? Conditional Independence Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? P(A|B) = P(A B)/P(B) Conditional Independence Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? P(A|B) = P(A B)/P(B) A = (AB) v (AB) P(A) = P(AB) + P(AB) Conditional Independence Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? P(A|B) = P(A B)/P(B) A = (AB) v (AB) P(A) = P(AB) + P(AB) P(AB) = P(A) x (1-P(B)) P(B) = 1-P(B) Conditional Independence Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? P(A|B) = P(A B)/P(B) A = (AB) v (AB) P(A) = P(AB) + P(AB) P(AB) = P(A) x (1-P(B)) P(B) = 1-P(B) P(AB) = P(A) Bayes’ Rule P(A B) = P(A|B) P(B) = P(B|A) P(A) P(A|B) P(B) P(B|A) = P(A) Example Given: – P(Cavity) = 0.1 – P(Toothache) = 0.05 – P(Cavity|Toothache) = 0.8 Bayes’ rule tells: – P(Toothache|Cavity) = (0.8 x 0.05)/0.1 = 0.4 Generalization P(ABC) = P(AB|C) P(C) = P(A|B,C) P(B|C) P(C) P(ABC) = P(AB|C) P(C) = P(B|A,C) P(A|C) P(C) P(A|B,C) P(B|C) P(B|A,C) = P(A|C) Web Size Estimation - Capture/Recapture Analysis Consider the web page coverage of search engines a and b – pa probability that engine a has indexed a page, pb for engine b, pa,b joint probability pa,b pa|b pb pa pb – sa number of unique pages indexed by engine a; N number of web pages sa s a ,b sa sb pa s s a b N N N N N sa,b web size – nb number of documents returned by b for a query, na,b number of documents returned by both engines a&b for a query sb nb sa ,b na ,b queries Lower bound estimate of size of the Web: ˆ nb N s ao ; sao known na ,b queries – random sampling assumption – extensions - bayesian estimate, more engines (Bharat, Broder, WWW7 ‘98), etc. Lawrence, Giles, Science’98 What we just covered Types of uncertainty Default/worst-case/probabilistic reasoning Probability Random variable/expected value Joint distribution Conditional probability Conditional independence Bayes rule The most common use of the term “information theory” •Shannon founded information theory with landmark paper published in 1948. •Founding both digital computer and digital circuit design theory in 1937 •21-year-old master's student at MIT, he wrote a thesis demonstrating that electrical application of Boolean algebra could construct and resolve any logical, numerical relationship. It has been claimed that this was the most important master's thesis of all time. •Shannon contributed to the basic work on code breaking. •Coined the term “bit” Information Theory (in classical sense) A model of innate information content of something Documents, images, messages, DNA Other models? Information – That which reduces uncertainty Entropy – A measure of information content – Conditional Entropy • Information content based on a context or other information Formal limitations on what can be – Compressed – Communicated – Represented Claude Shannon 1948 Shannon noted that the information content depends on the probability of the events, not just on the number of outcomes. Uncertainty is the lack of knowledge about an outcome. Entropy is a measure of that uncertainty (or randomness) – in information – in a system Information Theory – another definition Defines the amount of information in a message or document as the minimum number of bits needed to encode all possible meanings of that message, assuming all messages are equally likely What would be the minimum message to encode the days of the week field in a database? A type of compression! Fundamental Questions Addressed by Entropy and Information Theory What is the ultimate data compression for an information source? How much data can be sent reliably over a noisy communications channel? How accurately can we represent an object (e.g. image, etc.) as a function of the number of bits used. Good feature selection for data mining and machine learning Information Content I(x) Define the amount of information gained after observing an event x with probability p(x) is I(x) where: – I(x) = log2(1/p(x)) = - log2 p(x) Examples – Flip a coin, x = heads • p(heads) = 1/2; I(heads) = 1 – Role a die, x = 6 • p(6) = 1/6; I(6) = 2.58.. More information gained from observing a die toss than a coin flip. Why, there are more events. Properties of Information I(x) p(x) = 1; I = 0 – If we know with certainty the outcome of an event , there is no information gained by its occurrence I ( x) 0 – The occurrence of an event provides some or no information but it never results in a loss of information. I(x) > I(y) for p(x) < p(y) – The less probable an event is, the more information we gain from its occurrence. I(xy) = I(x) + I(y) : additive Entropy H(x) Entropy H(x) of an event is the expectation (average) of amount of information gained from that event over all possible happenings H ( x ) E I ( x ) p ( x ) I ( x ) p ( x ) log(1 / p( x )) x x Entropy is the average amount of uncertainty in an event. Entropy is the amount of information in a message or document A message in which everything is known p(x) = 1 has zero entropy Entropy as a function of probability H p(x) Max entropy occurs when all p(x)’s are equal! Entropy Definition Entropy Facts Examples of Entropy Average over all possible outcomes to calculate the entropy. If all events likely, more entropy if more events can occur. More possibilities (events) for a die than a coin => entropy die > entropy coin Joint Entropy H(x,x) = H(x) Mutual Information I(x;y) I(x;y) = H(y) – H(y|x) I is how much information x gives about y on the average Mutual Information I(x;y) – Entropy is a special case H(x) = I(x;x) – Symmetric: I(x;y) = I(y;x) • Uncertainty of x after seeing y is the same as the uncertainty of y after seeing x – Nonnegative I(x;y) 0 Other methods for making decisions Decision Trees – Powerful/popular for classification & prediction – Represent rules • Rules can be expressed in English – IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No • Rules can be expressed using SQL for query – Useful to explore data to gain insight into relationships of a large number of candidate input variables to a target (output) variable You use mental decision trees often! – Game: “I’m thinking of…” “Is it …?” Decision for playing tennis Outlook Tempreature Humidity Windy Class sunny hot high false N Outlook sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N sunny overcast overcast rain overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P humidity P windy overcast mild high true P overcast hot normal false P rain mild high true N high normal true false N P N P Grade decision tree Yes Grade = A Yes Grade = B Percent >= 90%? Yes Grade = C No 89% >= Percent >= 80%? No 79% >= Percent >= 70%? No Etc... Decision tree Written decision rules n If tear production rate = reduced then recommendation = no e. u If age = yo ng and astigmatic = no and tear production rate = normal then recommendation = soft e n If age = pr -presbyopic and astigmatic = no a d tear production rate = normal then recommendation = soft e If age = pr sbyopic and spectacle prescription = myope and n astigmatic = no then recommendation = no e If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft s If spectacle prescription = myope and astigmatic = ye and tear production rate = normal then recommendation = hard u If age = yo ng and astigmatic = yes and tear production rate = normal then recommendation = hard e If age = pr -presbyopic and spectacle prescription = hypermetrope n and astigmatic = yes then recommendation = no e e If age = pr sbyopic and spectacle prescription = hypermetrope n and astigmatic = yes then recommendation = no e Decision Tree Template Drawn top-to-bottom or left-to- right Root Top (or left-most) node = Root Child Child Leaf Node Descendent node(s) = Child Child Leaf Node(s) Leaf Bottom (or right-most) node(s) = Leaf Node(s) Unique path from root to each leaf = Rule Decision Tree – What is it? A structure that can be used to divide up a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous groups with respect to a particular target variable Decision Tree Types Binary trees – only two choices in each split. Can be non- uniform (uneven) in depth N-way trees or ternary trees – three or more choices in at least one of its splits (3-way, 4-way, etc.) Scoring Often it is useful to show the proportion of the data in each of the desired classes Decision Tree Splits (Growth) The best split at root or child nodes is defined as one that does the best job of separating the data into groups where a single class predominates in each group – Example: US Population data input categorical variables/attributes include: • Zip code • Gender • Age – Split the above according to the above “best split” rule Example: Good & Poor Splits Good Split Split Criteria The best split is defined as one that does the best job of separating the data into groups where a single class predominates in each group Measure used to evaluate a potential split is a purity measure – The purity measure answers the question, "Based upon a particular split, how good of a job did we do of separating the two classes away from each other?" We calculate this purity measure for every possible split and choose the one that gives the highest possible value. – The best split is one that increases purity of the sub-sets by the greatest amount – A good split also creates nodes of similar size or at least does not create very small nodes – Must have a stopping criteria Methods for Choosing Best Split Purity (Diversity) Measures: – Gini (population diversity) – Entropy (information gain) – Information Gain Ratio – Chi-square Test – others Gini (Population Diversity) The Gini measure of a node is the sum of the squares of the proportions of the classes. Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance) Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure) Pruning Decision Trees can often be simplified or pruned: – CART – C5 – Stability-based Decision Tree Advantages 1. Easy to understand 2. Map nicely to a set of domain rules 3. Applied to real problems 4. Make no prior assumptions about the data 5. Able to process both numerical and categorical data Decision Tree Disadvantages 1. Sensitive to initial conditions 2. Output attribute must be categorical 3. Small number of output attributes 4. Decision tree algorithms can be unstable 5. Trees created from numeric datasets can be complex (scaling) What we covered • Probabilistic reasoning • Flaws in human decision making • Decision trees • Information theory Propositions • Decision making is not easy • Humans often make mistakes • In some cases animals are smarter (empirical learning) • Probabilistic methods help • Data sensitive • Bayes methods • Information theory • measures the amount of information in a message or document(s) • Uses • Filtering • Data mining • Decision trees are useful for learning rules Questions • Role of reasoning in information science • Impact of probabilistic reasoning on information science • Role of decision making in formation science • What next?

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 2 |

posted: | 12/9/2011 |

language: | |

pages: | 86 |

OTHER DOCS BY yaofenjin

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.