Document Sample
probability Powered By Docstoc
					IST 511 Information Management: Information
              and Technology
            Probabilistic reasoning

                              Dr. C. Lee Giles
       David Reese Professor, College of Information Sciences
                         and Technology
          Professor of Computer Science and Engineering
        Professor of Supply Chain and Information Systems
       The Pennsylvania State University, University Park, PA,
            Special thanks to J. Lafferty, T. Cover, R.V. Jones
What are probabilities
What is information theory
What is probabilistic reasoning
   –   Definitions
   –   Why important
   –   How used – decision making
   –   Decision trees
Impact on information science
Topics used in IST
• Data mining, information extraction
• Metadata; digital libraries, scientometrics
• Others?
 Theories in Information Sciences
Enumerate some of these theories in this course.
   – Unified theory?
   – Domain of applicability
   – Conflicts
Theories here are
   – Very algorithmic
   – Some quantitative
   – Some qualitative
Quality of theories
   – Occam’s razor
   – Subsumption of other theories (foundational)
Theories of reasoning
   – Cognitive, algorithmic, social
             Probability vs all the others
Probability theory
• the branch of mathematics concerned with analysis of
  random phenomena.
    • Randomness: a non-order or non-coherence in a sequence of
      symbols or steps, such that there is no intelligible pattern or
• The central objects of probability theory are random
  variables, stochastic processes, and events
• mathematical abstractions of non-deterministic events or
  measured quantities that may either be single occurrences
  or evolve over time in an apparently random fashion.
    – A lack of knowledge about an event
    – Can be represented by a probability
        • Ex: role a die, draw a card
    – Can be represented as an error
Statistic (a measure in statistics)
    – Can use probability in determining that measure
     Founders of Probability Theory

       Blaise Pascal                    Pierre Fermat
    (1623-1662, France)              (1601-1665, France)

Laid the foundations of the probability theory in a
correspondence on a dice game posed by a French nobleman.
   Sample Spaces – measures of events
Collection (list) of all possible outcomes
   – e.g.: All six faces of a die:

   e.g.: All 52 cards in a deck:
              Types of Events

Simple event
  – Outcome from a sample space with one
    characteristic in simplest form
  – e.g.: King of clubs from a deck of cards
Joint event
  – Conjunction (AND); disjunction (OR)
  – Contains several simple events
  – e.g.: A red ace from a deck of cards
  – (ace on hearts OR ace of diamonds)
                      Visualizing Events

Excellent ways of determining probabilities:
Contingency tables (neat way to look at):

                                       Ace      Not Ace     Total
                             Black          2   24          26
                             Red            2   24          26
                             Total          4   48          52
Tree diagrams:
                 Full             Cards          Not an Ace
                 Deck             Black           Ace
                 of Cards         Cards
                                                     Not an Ace
         Review of Probability Rules
Given 2 events: G, H

1)   P(G OR H) = P(G) + P(H) - P(G AND H); for mutually
     exclusive events, P(G AND H) = 0
2)   P(G and H) = P(G)P(HG), also written as P(HG) = P(G and
3)   If G and H are independent, P(HG) = P(H), thus P(G AND
     H) = P(G)P(H)
4)    P(G) > P(G)P(H); P(H) > P(G)P(H)
Another way to express probability is in terms of odds, d

d = p/(1-p)
p = probability of an outcome

Example: What are the odds of getting a six on a dice throw?
We know that p=1/6, so
d = 1/6/(1-1/6) = (1/6)/(5/6) = 1/5.

Gamblers often turn it around and say that the odds against
  getting a six on a dice roll are 5 to 1.
          Probabilistic Reasoning
• Reasoning using probabilistic methods

• Reasoning with uncertainty

• Rigorous reasoning vs heuristics or biases
       Heuristics and Biases in Reasoning
Tversky & Kahneman showed that people often do not
  follow rules of probability

Instead, decision making may be based on heuristics
(heuristic decision making)

Lower cognitive load but may lead to systematic errors and

Example heuristics
   – Representativeness
   – Availability
   – Conjunctive fallacy
A fair coin is flipped. H heads, T tails
- What is a more likely sequence?

A) H T H T T H

-What is the result more likely to
   -   follow A)?
   -   follow B)?
Decision Tree
           Representativeness Heuristic

The sequence “H T H T T H” is seen as more representative
  of or similar to a prototypical coin sequence
While each sequence has the same probability of occurring

The likelihood of a flip following both A and B are the same:
½ H; ½ T

The T for B) is no more likely; events are independent
Gambler’s Fallacy

When is this not the case?
Linda is 31 years old, single, outspoken, and very bright. She
majored in philosophy. As a student, she was deeply concerned
with issues of discrimination and social justice, and also
participated in anti-nuclear demonstrations.

Please choose the most likely alternative:
   (a) Linda is a bank teller
   (b) Linda is a bank teller and is active in the feminist movement
               Conjunction Fallacy

Nearly 90% choose the second alternative (bank teller and
  active in the feminist movement), even though it is
  logically incorrect (conjunction fallacy)

     bank tellers                           feminists

bank tellers                                      feminists
who are not                                      who are not
 feminists          feminist bank tellers        bank tellers

        P(A) > P(A,B); P(B) > P(A,B)            Kahnemann and Tversky (1982)
   How to avoid these mistakes

Such mistakes can cause bad decisions and loss of
• Profits
• Lives
• Health
• Justice (prosecutor’s fallacy)
• Etc.

Instead, use probabilistic methods
Reasoning with an Uncertain Agent


   agent                  ?

An Old Problem … Getting
          Types of Uncertainty

For example, to drive your car in the morning:
Uncertainty in prior knowledge
• It must not have been stolen during the night
   E.g., some causes of a disease are unknown and are not
• It must not have flat tires
   represented in the background knowledge of a medical-
• There must be gas in the tank
   assistant agent
• The battery actions
Uncertainty inmust not be dead
• The ignition mustrepresented with relatively short lists of
   E.g., actions are work
   preconditions, while these lists are
• You must not have lost the car keys in fact arbitrary long
• No truck should obstruct the driveway
• You must not have suddenly become blind or paralytic

Not only would it not be possible to list all of them, but
would trying to do so be efficient?

How to represent uncertainty in knowledge?

How to reason (inferences) with uncertain

Which action to choose under uncertainty?
       Handling Uncertainty

Possible Approaches:
1. Default reasoning
2. Worst-case reasoning
3. Probabilistic reasoning
             Default Reasoning

Creed: The world is fairly normal. Abnormalities are rare
So, an agent assumes normality, until there is evidence of the
E.g., if an agent sees a bird x, it assumes that x can fly, unless it
   has evidence that x is a penguin, an ostrich, a dead bird, a
   bird with broken wings, …
       Worst-Case Reasoning

Creed: Just the opposite! The world is ruled by
Murphy’s Law
Uncertainty is defined by sets, e.g., the set possible
outcomes of an action, the set of possible positions
of a robot
The agent assumes the worst case, and chooses the
actions that maximizes a utility function in this case
Example: Adversarial search
        Probabilistic Reasoning

Creed: The world is not divided between “normal” and
  “abnormal”, nor is it adversarial. Possible situations have
  various likelihoods (probabilities)
The agent has probabilistic beliefs – pieces of knowledge with
  associated probabilities (strengths) – and chooses its actions
  to maximize the expected value of some utility function
            Notion of Probability

You drive on Atherton often, and you notice that 70%
                                   P(AvA) =
of the times there is a traffic slowdown at the1exit to Park. The
                                           you will P(A)
next time you plan to drive on Atherton, = P(A) +believe that
the proposition “there is a slowdown at the exit to Park” is True
with probability 0.7               P(A) = 1 - P(A)

The probability of a proposition A is a real number P(A)
  between 0 and 1
P(True) = 1 and P(False) = 0
P(AvB) = P(A) + P(B) - P(AB)

Axioms of probability
    Interpretations of Probability

      Frequency Interpretation

Draw a ball from a bag containing n balls of the same size, r
  red and s yellow.
The probability that the proposition A = “the ball is red” is true
  corresponds to the relative frequency with which we expect
  to draw a red ball  P(A) = r/n
        Subjective Interpretation

There are many situations in which there is no objective
 frequency interpretation:
  – On a windy day, just before paragliding from the top of El Capitan,
    you say “there is probability 0.05 that I am going to die”
  – You have worked hard in this class and you believe that the
    probability that you will get an A is 0.9
           Random Variables

A proposition that takes the value True with
   probability p and False with probability 1-p is a
   random variable with distribution (p,1-p)
If a bag contains balls having 3 possible colors – red,
   yellow, and blue – the color of a ball picked at
   random from the bag is a random variable with 3
   possible values
The (probability) distribution of a random variable X
   with n values x1, x2, …, xn is:
              (p1, p2, …, pn)
   with P(X=xi) = pi and Si=1,…,n pi = 1
                Expected Value

Random variable X with n values x1,…,xn and distribution
  E.g.: X is the state reached after doing an action A under
Function U of X
  E.g., U is the utility of a state
The expected value of U after doing A is
           E[U] =   Si=1,…,n pi   U(xi)
         Toothache Example

A certain dentist is only interested in two things
  about any patient, whether he has a toothache and
  whether he has a cavity
Over years of practice, she has constructed the
  following joint distribution:

                   Toothache    Toothache

      Cavity       0.04         0.06
      Cavity      0.01         0.89
  Joint Probability Distribution

k random variables X1, …, Xk
The joint distribution of these variables is a table in
which each entry gives the probability of one
combination of values of X1, …, Xk
                  Toothache   Toothache

         Cavity   0.04        0.06
         Cavity 0.01         0.89

 P(CavityToothache)           P(CavityToothache)
       Joint Distribution Says It All

                        Toothache    Toothache

               Cavity   0.04         0.06
               Cavity 0.01          0.89

P(Toothache) = P((Toothache Cavity) v (ToothacheCavity))
             = P(Toothache Cavity) + P(ToothacheCavity)
             = 0.04 + 0.01 = 0.05
P(Toothache v Cavity)
= P((Toothache Cavity) v (ToothacheCavity)
                       v (Toothache Cavity))
= 0.04 + 0.01 + 0.06 = 0.11
              Conditional Probability

   – P(AB) = P(A|B) P(B)
Read P(A|B): Probability of A given that
we know B
P(A) is called the prior probability of A
P(A|B) is called the posterior or conditional probability of A given

                            Toothache   Toothache

                  Cavity    0.04        0.06
                  Cavity 0.01          0.89

P(CavityToothache) = P(Cavity|Toothache) P(Toothache)
P(Cavity) = 0.1
P(Cavity|Toothache) = P(CavityToothache) / P(Toothache)
                 = 0.04/0.05 = 0.8

P(A  B  C) = P(A|B,C) P(B|C) P(C)
       Conditional Independence

Propositions A and B are (conditionally) independent iff:
             P(A|B) = P(A)
   P(AB) = P(A) P(B)
A and B are independent given C iff:
           P(A|B,C) = P(A|C)
   P(AB|C) = P(A|C) P(B|C)
     Conditional Independence

Let A and B be independent, i.e.:
             P(A|B) = P(A)
           P(AB) = P(A) P(B)
What about A and B?
     Conditional Independence

Let A and B be independent, i.e.:
             P(A|B) = P(A)
           P(AB) = P(A) P(B)
What about A and B?

         P(A|B) = P(A B)/P(B)
     Conditional Independence

Let A and B be independent, i.e.:
             P(A|B) = P(A)
           P(AB) = P(A) P(B)
What about A and B?

         P(A|B) = P(A B)/P(B)
         A = (AB) v (AB)
         P(A) = P(AB) + P(AB)
     Conditional Independence

Let A and B be independent, i.e.:
             P(A|B) = P(A)
           P(AB) = P(A) P(B)
What about A and B?

         P(A|B) = P(A B)/P(B)
         A = (AB) v (AB)
         P(A) = P(AB) + P(AB)
         P(AB) = P(A) x (1-P(B))
         P(B) = 1-P(B)
     Conditional Independence

Let A and B be independent, i.e.:
             P(A|B) = P(A)
           P(AB) = P(A) P(B)
What about A and B?

         P(A|B) = P(A B)/P(B)
         A = (AB) v (AB)
         P(A) = P(AB) + P(AB)
         P(AB) = P(A) x (1-P(B))
         P(B) = 1-P(B)
         P(AB) = P(A)
                Bayes’ Rule

P(A  B) = P(A|B) P(B)
       = P(B|A) P(A)

                         P(A|B) P(B)
            P(B|A) =

   – P(Cavity) = 0.1
   – P(Toothache) = 0.05
   – P(Cavity|Toothache) = 0.8
Bayes’ rule tells:
   – P(Toothache|Cavity) = (0.8 x 0.05)/0.1
                      = 0.4

P(ABC) = P(AB|C) P(C)
        = P(A|B,C) P(B|C) P(C)
P(ABC) = P(AB|C) P(C)
        = P(B|A,C) P(A|C) P(C)

                   P(A|B,C) P(B|C)
   P(B|A,C) =
        Web Size Estimation - Capture/Recapture Analysis

Consider the web page coverage of search engines a and b
    – pa probability that engine a has indexed a page, pb for engine b, pa,b joint probability
                                         pa,b  pa|b pb  pa pb
    – sa number of unique pages indexed by engine a; N number of web pages
                       sa       s a ,b                                 sa sb
                  pa 
                                   s s
                                   a b                           N
                       N       N N N                                 sa,b               web size
    – nb number of documents returned by b for a query, na,b number of documents returned
      by both engines a&b for a query
                                           sb        nb
                                        sa ,b   na ,b queries

Lower bound estimate of size of the Web:
                                                         ˆ           nb
                                                         N  s ao                        ; sao known
                                                                    na ,b      queries
    – random sampling assumption
    – extensions - bayesian estimate, more engines (Bharat, Broder, WWW7 ‘98), etc.

                                                                       Lawrence, Giles, Science’98
          What we just covered

Types of uncertainty
Default/worst-case/probabilistic reasoning
Random variable/expected value
Joint distribution
Conditional probability
Conditional independence
Bayes rule
  The most common use of the term “information theory”

•Shannon founded information theory with
landmark paper published in 1948.
•Founding both digital computer and digital circuit
design theory in 1937
   •21-year-old master's student at MIT, he wrote a thesis
   demonstrating that electrical application of Boolean
   algebra could construct and resolve any logical,
   numerical relationship. It has been claimed that this was
   the most important master's thesis of all time.
•Shannon contributed to the basic work on code
•Coined the term “bit”
Information Theory (in classical sense)

A model of innate information content of something
   Documents, images, messages, DNA
    Other models?
    – That which reduces uncertainty
    – A measure of information content
    – Conditional Entropy
       • Information content based on a context or other information
Formal limitations on what can be
    – Compressed
    – Communicated
    – Represented
             Claude Shannon 1948

Shannon noted that the information content depends on the
  probability of the events, not just on the number of
Uncertainty is the lack of knowledge about an outcome.
Entropy is a measure of that uncertainty (or randomness)
   – in information
   – in a system
      Information Theory – another

Defines the amount of information in a message or document
   as the minimum number of bits needed to encode all
   possible meanings of that message, assuming all messages
   are equally likely
What would be the minimum message to encode the days of
   the week field in a database?
A type of compression!
   Fundamental Questions Addressed by
     Entropy and Information Theory

What is the ultimate data compression for an information
How much data can be sent reliably over a noisy
  communications channel?
How accurately can we represent an object (e.g. image, etc.)
  as a function of the number of bits used.
Good feature selection for data mining and machine learning
           Information Content I(x)

Define the amount of information gained after observing an
  event x with probability p(x) is I(x) where:
   –     I(x) = log2(1/p(x)) = - log2 p(x)

   – Flip a coin, x = heads
       • p(heads) = 1/2; I(heads) = 1
   – Role a die, x = 6
       • p(6) = 1/6; I(6) = 2.58..

More information gained from observing a die toss than a
 coin flip. Why, there are more events.
      Properties of Information I(x)

p(x) = 1; I = 0
    – If we know with certainty the outcome of an event ,
      there is no information gained by its occurrence
  I ( x)  0
    – The occurrence of an event provides some or no
       information but it never results in a loss of
I(x) > I(y) for p(x) < p(y)
    – The less probable an event is, the more
       information we gain from its occurrence.
I(xy) = I(x) + I(y) : additive
                      Entropy H(x)

Entropy H(x) of an event is the expectation (average) of
  amount of information gained from that event over all
  possible happenings

H ( x )  E I ( x )    p ( x ) I ( x )   p ( x ) log(1 / p( x ))
                         x                  x

Entropy is the average amount of uncertainty in an event.
Entropy is the amount of information in a message or
A message in which everything is known p(x) = 1 has zero
     Entropy as a function of probability



    Max entropy occurs when all p(x)’s are equal!
Entropy Definition
Entropy Facts
                Examples of Entropy

Average over all possible outcomes to calculate the entropy.

If all events likely, more entropy if more events can occur.
     More possibilities (events) for a die than a coin =>
         entropy die > entropy coin
            Joint Entropy

H(x,x) = H(x)
         Mutual Information I(x;y)

I(x;y) = H(y) – H(y|x)
I is how much information x gives about y on the average
      Mutual Information I(x;y)

– Entropy is a special case H(x) = I(x;x)
– Symmetric: I(x;y) = I(y;x)
   • Uncertainty of x after seeing y is the same as the uncertainty of

      y after seeing x
– Nonnegative I(x;y) 0
 Other methods for making decisions

Decision Trees
   – Powerful/popular for classification & prediction
   – Represent rules
      • Rules can be expressed in English
          – IF Age <=43 & Sex = Male
            & Credit Card Insurance = No
            THEN Life Insurance Promotion = No
      • Rules can be expressed using SQL for query
   – Useful to explore data to gain insight into relationships
     of a large number of candidate input variables to a
     target (output) variable
You use mental decision trees often!
   – Game: “I’m thinking of…” “Is it …?”
                         Decision for playing tennis

Outlook    Tempreature Humidity Windy Class
sunny      hot         high     false   N                        Outlook
sunny      hot         high     true    N
overcast   hot         high     false   P
rain       mild        high     false   P
rain       cool        normal false     P
rain       cool        normal true      N             sunny overcast
                                                            overcast       rain
overcast   cool        normal true      P
sunny      mild        high     false   N
sunny      cool        normal false     P
rain       mild        normal false     P
sunny      mild        normal true      P         humidity         P              windy
overcast   mild        high     true    P
overcast   hot         normal false     P
rain       mild        high     true    N

                                          high          normal              true          false

                                              N              P               N             P
                  Grade decision tree

                      Yes                                   Grade = A

                              Yes                                 Grade = B
Percent >= 90%?

                                                   Yes             Grade = C
        No        89% >= Percent >= 80%?

                               No      79% >= Percent >= 70%?

                                              No         Etc...
Decision tree
Written decision rules

 If tear production rate = reduced then recommendation = no e.
 If age = yo ng and astigmatic = no and tear production rate = normal
    then recommendation = soft
            e                                 n
 If age = pr -presbyopic and astigmatic = no a d tear production
    rate = normal then recommendation = soft
 If age = pr sbyopic and spectacle prescription = myope and
    astigmatic = no then recommendation = no e
 If spectacle prescription = hypermetrope and astigmatic = no and
    tear production rate = normal then recommendation = soft
 If spectacle prescription = myope and astigmatic = ye and
    tear production rate = normal then recommendation = hard
 If age = yo ng and astigmatic = yes and tear production rate =
    then recommendation = hard
 If age = pr -presbyopic and spectacle prescription = hypermetrope
    and astigmatic = yes then recommendation = no e
 If age = pr sbyopic and spectacle prescription = hypermetrope
    and astigmatic = yes then recommendation = no e
             Decision Tree Template

Drawn top-to-bottom or left-to-
Top (or left-most) node = Root
                                           Child      Child      Leaf
Descendent node(s) = Child         Child                  Leaf

Bottom (or right-most) node(s) =
Leaf Node(s)
Unique path from root to each
leaf = Rule
        Decision Tree – What is it?

A structure that can be used to divide up a large
  collection of records into successively smaller sets
  of records by applying a sequence of simple
  decision rules
A decision tree model consists of a set of rules for
  dividing a large heterogeneous population into
  smaller, more homogeneous groups with respect
  to a particular target variable
              Decision Tree Types

Binary trees – only two choices in each split. Can be non-
  uniform (uneven) in depth
N-way trees or ternary trees – three or more choices in at least
  one of its splits (3-way, 4-way, etc.)
Often it is useful to show the proportion of the data in each of
  the desired classes
        Decision Tree Splits (Growth)

The best split at root or child nodes is defined as one
  that does the best job of separating the data into
  groups where a single class predominates in each
   – Example: US Population data input categorical
     variables/attributes include:
      • Zip code
      • Gender
      • Age
   – Split the above according to the above “best split” rule
Example: Good & Poor Splits

 Good Split
                     Split Criteria

The best split is defined as one that does the best job of
  separating the data into groups where a single class
  predominates in each group
Measure used to evaluate a potential split is a purity measure
   – The purity measure answers the question, "Based upon a particular
     split, how good of a job did we do of separating the two classes away
     from each other?" We calculate this purity measure for every possible
     split and choose the one that gives the highest possible value.
   – The best split is one that increases purity of the sub-sets by the
     greatest amount
   – A good split also creates nodes of similar size or at least does not
     create very small nodes
   – Must have a stopping criteria
 Methods for Choosing Best Split

Purity (Diversity) Measures:
   – Gini (population diversity)

   – Entropy (information gain)

   – Information Gain Ratio

   – Chi-square Test

   – others
        Gini (Population Diversity)

The Gini measure of a node is the sum of the squares of the
  proportions of the classes.

                    Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)

                 Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)

Decision Trees can often
be simplified or pruned:
   – CART
   – C5
   – Stability-based
          Decision Tree Advantages

1.   Easy to understand
2.   Map nicely to a set of domain rules
3.   Applied to real problems
4.   Make no prior assumptions about the data
5.   Able to process both numerical and categorical data
        Decision Tree Disadvantages

1.   Sensitive to initial conditions

2.   Output attribute must be categorical

3.   Small number of output attributes

4.   Decision tree algorithms can be unstable

5.   Trees created from numeric datasets can be complex (scaling)
              What we covered

•   Probabilistic reasoning
•   Flaws in human decision making
•   Decision trees
•   Information theory
• Decision making is not easy
   • Humans often make mistakes
       • In some cases animals are smarter (empirical learning)
• Probabilistic methods help
   • Data sensitive
   • Bayes methods
• Information theory
   • measures the amount of information in a message or document(s)
   • Uses
       • Filtering
       • Data mining
• Decision trees are useful for learning rules

• Role of reasoning in information science
• Impact of probabilistic reasoning on
  information science
• Role of decision making in formation science
• What next?

Shared By: