Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Basic Definations

VIEWS: 62 PAGES: 41

									                INFORMATION THEORY

 Communication theory deals with systems for transmitting information
  from one point to another.




 Information theory was born with the discovery of the fundamental
  laws of data compression and transmission.



                          INFORMATION THEORY                             1
 Introduction

    Information theory answers two fundamental questions:

•   What is the ultimate data compression?
    Answer: The Entropy H.
•   What is the ultimate transmission rate?
    Answer: Channel Capacity C.

    But its reach is beyond Communication Theory. In early days it was
    thought that increasing transmission rate over a channel increases the
    error rate. Shannon showed that this is not true as long as rate is
    below Channel Capacity.

    Shannon has further shown that random processes have an irreducible
    complexity below which they can not be compressed.



                           INFORMATION THEORY                                2
    Information Theory (IT) relates to other fields:

•   Computer Science: shortest binary program for computing a string.
•   Probability Theory: fundamental quantities of IT are used to estimate
    probabilities.
•   Inference: approach to predict digits of pi. Infering behavior of stock
    market.
•   Computation vs. communication: computation is communication limited
    and vice-versa.


                            INFORMATION THEORY                            3
    It has its beginning at the start of the century but it really took of after
    WW II.

•   Weiner: extracting signals of a known ensemble from noise of a
    predictable nature.

•   Shannon: encoding messages chosen from a known ensemble so that
    they can be transmitted accurately and rapidly even in the presence of
    noise.

    IT: The study of efficient encoding and its consequences in the form of
    speed of transmission and probability of error.




                             INFORMATION THEORY                                4
 Historical Perspective

•   Follows S. Verdu, “Fifty years of Shannon Theory,” IT-44, Oct. 1998,
    pp. 2057-2058.

•   Shannon published “ A mathematical theory of communication” in
    1948. It lays down fundamental laws of data compression and
    transmission.

•   Nyquist (1924): transmission rate is proportional to the log of the
    number of levels in a unit duration.
    - Can transmission rate be improved by replacing Morse by an
    „optimum‟ code?

•   Whitlaker (1929): loseless interpolation of bandlimited function.

•   Gabor (1946): time-frequency uncertainty principle.


                            INFORMATION THEORY                             5
•   Hartley (1928): muses on the physical possibilities of transmission
    rates.

    - Introduces a quantitative measure for the amount of information
    associated with n selection of states.


                                      H=n log s

    where s = symbols available in each selection.
          n = # of selections.

    - Information = outcome of a selection among a finite number of
    possibilities.




                            INFORMATION THEORY                            6
 Data Compression

• Shannon uses the definition of entropy
                              n
                        H   pi log pi
                             i 1

  as a measure of information.
  Rationals: (1) continuous in prob.
              (2) increasing with n for equiprobable r.v.
              (3) additive – entropy of a sum of r.v. is equal to the
  sum of entropies of the individual r.v.
• Entropy satisfies for memoryless sources:
  Shannon Theorem 3: Given any  0 and   0 , we can find No
  such that sequences of any length N  N 0 fall into two classes:
  (1) A set whose probability is less than 
  (2) The reminder set, all of whose members have probabilities
  {p} satisfying                     1
                                    log p
                             H             
                                       N

                        INFORMATION THEORY                          7
 Reliable Communication

•   Shannon: …..redundancy must be introduced to combat the particular
    noise structure involved … a delay is generally required to approach
    the ideal encoding.
•   Defines channel capacity
                        C  max( H ( X )  H ( X / Y ))
• It is possible to send information at the rate C through the channel
    with as small a frequency of errors or equivocation as desired by
    proper encoding. This statement is not true for any rate greater than C.
•   Defines differential entropy of a continuous random variable as a
    formal analog to the entropy of a discrete random variable.
•   Shannon obtains the formula for the capacity of:
    - power-constrained
    - white Gaussian channel
    - flat transfer function
                                        PN 
                             C  W log      
                                        N 

                           INFORMATION THEORY                              8
•   The minimum energy necessary to transmit one bit of information is
    1.6 dB below the noise psd.

•   Some interesting points about the capacity relation:
    - Since any strictly bandlimited signal has infinite duration, the rate of
    information of any finite codebook of bandlimited waveforms is equal
    to zero.
    - Transmitted signals must approximate statistical properties of white
    noise.

•   Generalization to dispersive/nonwhite Gaussian channels by Shannon‟s
    “water-filling” formula.

•   Constraints other then power constraints are of interest:
    - Amplitude constraints
    - Quantized constraints
    - Specific modulations.


                             INFORMATION THEORY                                  9
 Zero-Error Channel Capacity

•   Example of typing a text: a non-zero probability of making an error
    with the prob. = 1 as the length increases.

•   By designing a code that takes into account the statistics of the typist‟s
    mistakes, the prob. of error can be made  0.

•   Example: consider mistakes made by mistyping neighboring letters.
    The alphabet { b, I, t, s} has no neighboring letters, hence will have
    zero probability of error.

•   Zero-error capacity: the rate at which information can be encoded with
    zero prob. of error.




                            INFORMATION THEORY                               10
 Error Exponent

•   Rather than focus on the channel capacity, study the error probability
    (EP) as a function of block length.

•   Exponential decrease of EP as a function of blocklength in Gaussian,
    discrete memoryless channel.

•   The exponent of the minimum achievable EP is a function of the rate
    referred to as reliability function.

•   An important rate that serves as lower bound to the reliability function
    is the cutoff rate.

•   Was long thought to be the “practical” limit to transmission rate.

•   Turbo codes refuted that notion.


                            INFORMATION THEORY                               11
 ERROR CONTROL MECHANISMS

 Error Control Strategies

•   The goal of „error-control‟ is to reduce the effect of noise in order to
    reduce or eliminate transmission errors.
•   „Error-Control Coding‟ refers to adding redundancy to the data. The
    redundant symbols are subsequently used to detect or correct
    erroneous data.




                             INFORMATION THEORY                                12
•   Error control strategy depends on the channel and on the specific
    application.

    - Error control for one-way channels are referred to as forward error
    control (FEC). It can be accomplished by:
        * Error detection and correction – hard detection.
        * Reducing the probability of an error – soft detection
    - For two-way channels: error detection is a simpler task that error
    correction. Retransmit the data only when an error is detected:
    automatic request (ARQ).

•   In the course, we focus on wireless data communications, hence we
    will not delve in error concealment techniques such as interpolation,
    used in audio and video recording.
•   Error schemes may be priority based, i.e., providing more protection to
    certain types of data that others. For example, in wireless cellular
    standards, the transmitted bits are divided in three classes: bits that
    get double code protection, bits that get single code protection, and
    bits that are not protected.
                           INFORMATION THEORY                               13
 Block and Convolutional Codes

•   Error control codes can be divided into two large classes: block codes
    and convolutional codes.
•   Information bits encoded with an alphabet Q of q distinct symbols.




•   Designers of early digital communications system tried to improve
    reliability by increasing power or bandwidth.


                           INFORMATION THEORY                                14
•   Shannon taught us how to buy performance with a less expensive
    resource: complexity.

•   Formal definition of a code C: a set of 2k n-tuples.

•   Encoder: the set of 2k pairs (m,c), where m is the data word and c
    is the code word.

•   Linear code: the set of codewords is closed under modulo-2 addition.

•   Error detection and correction correspond to terms in the Fano
    inequality:

                   H ( X / Y )  H (e)  P(e) log(2k  1)
    - Error detection reduces     H (e)
    - Error correction reduces   P (e) log(2k  1)


                            INFORMATION THEORY                             15
 BASIC DEFINITIONS

    Define Entropy, Relative Entropy, Mutual Information

 Entropy, Mutual Information
  A measure of uncertainty of a random variable.
  Let x be a discrete random variable (r.v.) with alphabet
  A ( x  A) and probability mass p(x) = Pr {X=x}.

•   (D1) The entropy H(x) of a discrete r.v. x is defined
              H ( X )   p ( x) log p ( x) bits
                         x A


    where log is to the base 2.
•   Comments: (1) simplest example: entropy of a fair coin = 1bit.
                (2) Adding terms of zero probability does not change
    entropy (0log 0 = 0).
                (3) Entropy depends on probabilities of x, not on actual
    values.
                (4) Entropy is H(x) = E [ log 1/p(x) ]

                                INFORMATION THEORY                         16
 Properties of Entropy

•   (P1)    H(x)  0
           0  p(x)  1    log [ 1/p(x) ]  0
•   [E]    x=1       p
                0 1-p
            H(x) = - p log p – (1-p) log(1-p)
                  = H(p)




                           INFORMATION THEORY   17
•   [E]      x=a      1/2
                b      1/4
                c      1/8
                d      1/8

          H(x) = ½log½ - ¼log¼ - 1/8log 1/8 - 1/8log 1/8 = 1.75 bits

 Another interpretation of entropy

    Use minimum number of questions to determine value of X:
          Is X=a
              no
          Is X=b
               no
             
          Is X=c
    It turns out that the expected number of binary questions is 1.75.


                             INFORMATION THEORY                          18
•   (D2) The joint entropy H(X,Y) is defined
          H ( X , Y )     p ( x, y ) log p ( X , Y )
                            x A yB

                       or
          H ( X , Y )   E log p ( X , Y )
    where ( X , Y )       p ( x, y )


•   (D3) Conditional entropy H(Y|X)

                H (Y | X )         p ( x) H (Y
                                   x A
                                                   | X  x)

                         p ( x )  p ( y | x ) log p ( y | x )
                            x A          y B

                          p ( x, y ) log p ( y | x )
                            x A yB

                        E[log p ( y | x )]


                          INFORMATION THEORY                        19
•   (P2)        (chain  rule)
    H ( X , Y )  H ( X )  H (Y | X )


    H ( X , Y )   p ( x, y ) log p ( x, y )
                    x   y

                p ( x, y ) log p ( x ) p ( y | x )
                    x   y

                p ( x, y ) log p ( x )   p ( x, y ) log p ( y | x )
                    x   y                         x   y

                p ( x) log p ( x )   p ( x, y ) log p ( y | x )
                    x                     x   y

    H ( X , Y )  H ( X )  H (Y | X )
    Entropy: A measure of uncertainty of a r.v.
              The amount of information required on the average to
              describe the r.v.
    Relative entropy: A measure of the distance between two
    distributions.

                              INFORMATION THEORY                             20
•   (D4) Relative entropy or Kullback Leibler distance between two
    probability mass functions p(x) and q(x) is defined
                                                    p ( x)
            D( p       q)       x
                                      p ( x ) log
                                                    q( x)

    Relative entropy is 0 iff p=q
    Mutual information: A measure of the amount of information one r.v.
    contains about another r.v..

•   (D5) Given two r.v. , X , Y p( x, y) and marginal distributions p(x),
    p(y), the mutual information is the relative entropy between the joint
    distribution p(x,y) and the product distribution p(x)p(y):
                                                 p ( x, y )
      I ( X ;Y )      x, y
                               p ( x, y ) log
                                                p ( x) p( y )
                     D ( p ( x, y )        p ( x ) p ( y ))



                               INFORMATION THEORY                            21
           A  {0,1
•    (E)           }
        p (0)  1  r , p (1)  1  s, q (0)  1  s, q (1)  s

                                        1 r         r
        D( p       q )  (1  r ) log         r log
                                        1 s         s
                   1       1
            r        ,s    
                   2       4
        D( p       q )  0.2075 bits
        D(q        p )  0.1887 bits
     In general D( p q)  D(q               p)
 Properties of MI:
                                         p ( x, y )
    I ( X ;Y )     p ( x, y ) log
                   x, y                 p( x) p( y )
                                        p( x / y)
                   p ( x, y ) log
                   x, y                  p( x)
                  p ( x, y ) log p ( x)   p ( x, y ) log p ( x / y )
                     x, y                           x, y

                                                                           
                  p ( x ) log p ( x )     p ( x, y ) log p ( x / y ) 
                   x                        x, y                           



                                   INFORMATION THEORY                           22
•   (P1) I(X,Y) = H(X) – H(X,Y)

    Interpretation: Mutual Information (MI) is the reduction in the
    uncertainty of X due to the knowledge of Y.
    X says about Y as much as Y says about X:

•   (P2) I(X,Y) = H(Y) – H(Y|X) = I(Y,X)

•   (P3) I(X,X) = H(X)    (no reduction of certainly)

         Since H(X,Y) = H(X) + H(Y|X) (chain rule), it follows that
         H(Y|X) = H(X,Y) – H(X), hence

•   (P4) I(X,Y) = H(X) + H(Y) – H(X,Y)




                           INFORMATION THEORY                         23
 Multiple Variables – Chain Rules

    In this section, some of the results of the previous section are
    extended to multiple variables.
•   (T1) Chain Rule for Entropy:
         Let X 1 , X 2 ,..., X n p ( x1 , x2 ,..., xn )
                                 then
                                      n
       H ( X 1 , X 2 ,..., X n )     H (X
                                     i 1
                                              i   | X i 1 ,..., X 1 )


                H ( X1, X 2 )  H ( X1 )  H ( X 2 | X1 )
           H ( X1, X 2 , X 3 )  H ( X1 )  H ( X 2 , X 3 | X1 )
                                 H ( X1 )  H ( X 2 | X1 )  H ( X 3 | X 2 , X1 )
•   (D6) The conditional mutual information of random variables X
    and Y given Z is defined by
        I(X;Y|Z) = H(X,Z) – H(X|Y,Z)



                                INFORMATION THEORY                                   24
•   (T2) Chain rule for mutual information:
                             n
     I ( X 1 ,..., X n )   I ( X 1 ; Y | X i 1 ,..., X 1 )
                           i 1

               Proof :
                                                                 p ( x1 , x2 , y )
        I ( X1, X 2 )      
                           x1, x2 , y
                                        p ( x1 , x2 , y ) log
                                                                p ( x1 , x2 ) p ( y )
                        use
     p ( X 1 , X 2 ; Y )  p ( x1 , x2 | y ) p ( y ), then
                                                                p ( x1 , x2 | y )
     I ( X1, X 2 ;Y )      
                           x1, x2 , y
                                        p ( x1 , x2 , y ) log
                                                                  p ( x1 , x2 )
                                                                                                                             
                                        p ( x1 , x2 , y ) log p ( x1 , x2 )     p ( x1 , x2 , y ) log p( x1 , x2 | y ) 
                                                                                  x x ,y                                     
                              x1, x2 , y                                          1, 2                                       
                                                                                                            
                           p ( x1 , x2 ) log p ( x1 , x2 )     p ( x1 , x2 , y ) log p( x1 , x2 | y ) 
                                                                 x x ,y                                     
                            x1, x2                               1, 2                                       
                         H ( X1, X 2 )  H ( X1, X 2 | Y )
                         H ( X 1 )  H ( X 2 | X 1 )  [ H ( X 1 / Y )  H ( X 2 | X 1 , Y )]
           I ( X 1 , X 2 | Y )  I ( X 1; Y )  I ( X 2 ; Y | X1 )

    This can be generalized to arbitrary n.


                                                  INFORMATION THEORY                                                              25
•   (D7) The conditional relative entropy D(p(y|x)||q(y|x)) is the relative
    entropy between the corresponding conditional distributions averaged
    over x:
                                                                          p( y | x)
             D ( p ( y | x) || q ( y | x))   p ( x) p ( y | x) log
                                             x         y                  q( y | x)

•   (T3) Chain rule for relative entropy

       D ( p ( x, y ) || q ( x, y ))  D ( p ( x) || q ( x))  D( p ( y | x) || q( y | x))
    Proof:
                                                           p ( x, y )
       D ( p ( x, y ) || q ( x, y ))   p ( x, y ) log
                                      x, y                 q ( x, y )
                                             p ( x)                    p( y | x)
                       p ( x, y ) log               p ( x, y ) log
                        x, y                 q ( x) x , y              q( y | x)
                      D ( p ( x) || q ( x))  D( p ( y | x) || q ( y | x))



                                  INFORMATION THEORY                                         26
 Jensen’s Inequality


•    (D8) f(x) is convex over interval (a,b) if

    x1 , x2  (a, b), 0    1
    f ( x1  (1   ) x2 )   f ( x1 )  (1   ) f ( x2 )

     Strictly convex if the strict inequality holds.



•    (D9) f(x) is concave if –f is convex.

        Convex function always lies below any chord (straight line connecting
     two points on the curve). Convex function are very important in I.T..




                                     INFORMATION THEORY                    27
   Simple results for convex functions:
• (T4) If f "( x)  0 the function is convex
   Proof:                                                           1
                         f ( x)  f ( x0 )  f ' ( x0 )( x1  x0 )  f '' ( x* )( x  x0 ) 2
  Taylor Expansion:                                                 2
                where        x0  x*  x
                    let      x0   x1  (1   ) x2
  Since the last term in the Taylor expansion is non-negative,

                                    x  x1  f ( x1 )  f ( x0 )  f ' ( x0 )( x1  x2 )
                                       f ( x0 )  f ' ( x0 )[(1   )( x1  x2 )]
                     Similarly x  x2  f ( x2 )  f ( x0 )  f ( x0 )[ ( x2  x1 )]
                                                                         '


                           f ( x1 )  f ( x0 )  f ' ( x0 ) (1   )( x1  x2 )
                     (1   ) f ( x2 )  (1   ) f ( x0 )  f ' ( x0 ) (1   )( x2  x1 )
          f ( x1 )  (1   ) f ( x2 )  f ( x0 )
                       Using        x0   x1  (1   ) x2 ,

   The relation meets the definition of a convex function.



                                   INFORMATION THEORY                                          28
•   (T5) (Jensen’s Inequality)
         (1) If f is convex and X is a r.v.
                            Ef ( X )  f ( EX )
         (2) If f is strictly convex and E f(X) = f(EX)
                   then, X = EX, i.e., X is a constant.
    Proof: Let the number of discrete points be 2: X1, X2
           From the definition of convex functions:
          E[ f ( X )]  p1 f ( X )  p2 f ( X 2 )  f ( p1 X 1  p2 X 2 )  f ( EX )
    Induction: suppose the theorem is true for k-1 points.

               pi /(1  pk )
    let   pi
                                           , this makes pi a set of probabilities.
                            k                                           k 1
          E ( f ( X ))    
                           i 1
                                  pi f ( xi )  pk f ( xk )  (1  pk )  pi ' f ( xi )
                                                                        i 1
                                                k 1
           pk f ( xk )  (1  pk ) f (  pi ' X i )
                                                i 1
                                        k 1
           f ( pk X k  (1  pk )  pi ' X i )
                                         i 1
                 k
           f (  pi X i )
                i 1

    From Jensen‟s inequality follow a number of fundamental IT theorems
                                  INFORMATION THEORY                                  29
•   (T6) (Information inequality) p(X), q(x)
    With equality iff D( p || q)  0
    Proof:           p( x)  q( x) x
                                                         p( x)
              D ( p || q )    p ( x ) log
                                         x A            q( x)
                                               q( x)
                          
                           x A
                                  p ( x ) log
                                               p( x)
                                               p( x)
                        log            p( x)
                                  x A         q( x)
                        log  q ( x )  log1  0
                                  x A



    If p( x)  q( x), equality is clearly obtained.
    If equality holds, it means that q( x)  p( x) (since the log is strictly
    concave).



                             INFORMATION THEORY                                 30
•   (T7) (Non-negativity of MI)

         r.v. X , Y       I ( X ;Y )  0
         I(X;Y) = 0 iff X, Y are independent.

    Proof: Follows from the relation I ( X ; Y )  D( p( x, y) || p( x) p( y))  0
    From the information inequality, the equality holds iff
    p(x,y) = p(x)p(y), i.e., X, Y are independent.

          Let |A| be the number of elements in the set A.
•   (T8) H ( X ) | A |, H ( X )  log | A | iff
          X has a uniform distribution over A.

     Proof: Let u(X) = 1/ |A| be the uniform distribution.
                                        p ( x)
           D( p || u )   p ( x) log            p ( x) log( A p ( x))
                                        u ( x)
            log A  H ( X )  0

     Interpretation: uniform distribution achieves maximum entropy.

                              INFORMATION THEORY                                     31
•   (T9) (Conditioning reduces entropy)
                                H(X | Y)  H(X )
                      H(X|Y) = H(X) iff X and Y are independent
    Proof: It follows from 0  I ( X , Y )  H ( X )  H ( X | Y )
    Interpretation: on the average, knowing about Y can only reduce the
    uncertainty about X.
                                                                                     1
                               p ( x)   p ( X , Y )  p ( x  1)   p (1, y ) 
                                        y                            y               8
                                                            7
                               p ( x  2)   p (2, y ) 
                                            y               8
                                            1 7
                               H ( X )  H ( , )  0.544 bits
                                            8 8
                                                                                3    3
                               H ( X | Y  1)   p ( x |1) log p ( x |1)  0   log  0.3113
                                                 x                              4    4
                                                                                1    1 1    1 3
                               H ( X | Y  2)   p ( x | 2) log p ( x | 2)   log  log 
                                                  x                             8    8 8    8 4
                                            3                  1
                               H ( X | Y )  H ( X | Y  1)  H ( X | Y  2)  0.4210
                                            4                  4

    The uncertainty of X is decreased if Y=1 is observed, it is increased if Y=2
    is observed, and is decreased on the average.

                               INFORMATION THEORY                                           32
•   (T10) (Independence bound for entropy)

         X 1 , X 2 ,..., X n        p ( X 1 , X 2 ,..., X n ), then
                                     n
    H ( X 1 , X 2 ,..., X n )       H (X
                                    i 1
                                              i   )



    equality iff Xi are independent.
    Proof:
             H ( X1, X 2 )  H ( X1 )  H ( X 2 | X1 )
                                
                           Chain Rule

                                H ( X1 )  H ( X 2 )
                               
                               T9


                            INFORMATION THEORY                        33
•   (T11) (Log Sum Inequality)
     ai , ..., an , bi , ..., bn  0
                                               n

       n
               ai     n                       a     i

       ai log b  ( ai ) log                i 1
                                                n
      i 1      i   i 1
                                              b     i
     Equality iff ai/bi = const               i 1



     Proof:
      let
      i  0 and               i   1
      f (t )  t log t
                               is strictly convex since its second
    derivative > 0, hence by Jensen‟s inequality
           i   f (ti )  f (   i ti )
     set
      i  bi /  b j , t  ai / bi ,
                                 i


      then
                                  bi   a     a                bi   a           bi   a
               f (ti )            * i log i                 * i log         * i
                                  b j bi                     b j bi          b j bi
             i
                                             bi
             ai     a                      ai             ai
                log i                       log 
       i     bj    bi                i    bj     i      bj
                                                         i




                                          INFORMATION THEORY                        34
•   (T12) Convexity of Relative Entropy

    D( p1  (1   ) p2 ||  q1  (1   )q2 )   D( p1 || q1 )  (1   )( p2 || q2 )
    Proof: log sum inequality
                                         p1  (1   ) p2           p                   (1   ) p2
    LHS  ( p1  (1   ) p2 )log                           p1 log 1  (1   ) p2 log
                                         q1  (1   )q2             q1                 (1   ) q2
      D( p1 || q1 )  (1   ) D( p2 || q2 )
•   (T13) Concavity of Entropy                         H(p) is a concave function of p.

    Proof: H(p) = log A  - D(p||u)
          since D is convex , H is concave.

•   (D10) The r.v. X, Y, Z form a Markov Chain X Y  Z if
                         ( x, y, z)  p( x) p( y | x) p( z | y)
                    (z is conditionally independent of x)




                                     INFORMATION THEORY                                          35
•   (T14) Data Processing Inequality:
     If X Y  Z, then
     I(X;Y)  I(X;Z)
     Proof: Chain Rule for information (T2 slide 46)
            I(X;Y,Z) = I(X;Z) + I(X;Y|Z)
     also   I(X;Y,Z) = I(X;Y) + I(X;Z|Y)
             since X, Z are independent given Y , I(X;Z|Y) = 0
             It follows
                     I ( X ;Y )  I ( X ;Y , Z )  I ( X ; Z )  I ( X ;Y | Z )
                                 I(X ,Z)
    Equality iff I(X;Y|Z) = 0 i.e. X  Y Z also form a Markov Chain.

     In particular if Z = g(Y) we have
                            I ( X ; Y )  I ( X , g (Y ))
     A function of the data Y can not increase the information about X.



                                  INFORMATION THEORY                              36
 Application – Sufficient Statistic

    Use data processing inequality to clarify idea of sufficient statistic.

•   (D11) A function T(X) is a sufficient statistic relative to the family
           { f ( x)} if
     X is independent of  given T(X), i.e. ,   T ( X )  X
     T(X) provides all info on  .
     In general, we have             X T(X )
     { f ( x)} a family of distributions, X a sample from a dist.
     T(X)         a function of the sample.
     Hence, by the data processing inequality

       I ( ; T ( X ))  I ( ; X )
      For a sufficient statistic I ( ; X )  I ( ; T ( X )) which means that MI is
      preserved.


                                  INFORMATION THEORY                               37
•   Example
     (1) X 1 ,..., X n , X i  {0,1} the distribution

     parameter is   Pr( X i  1)
          define T ( X 1 ,..., X n )   X i
                                       n


                                             i 1


    How to show independence of X and  ? Show that given T, all
    sequences with k ones are equally likely, independent of  .
                                                   n
                                                                          
                      Pr ( X 1 ,..., X n )  ( x1 ,..., xn ) |  X i  k 
                                                               i 1      
                         1
                         n 
                        
                                         X i  k prob. of k out of n
                         k 
                        0
                                             otherwise

             thus
                           X
                              T
                                    i    ( X 1 ,..., X n )

     forms a Markov chain and T is a sufficient statistic.


                                  INFORMATION THEORY                          38
 Fano’s Inequality

  Suppose we know r.v. Y and wish to guess the value of correlated r.v.
  X. Intuition says that is if H(X|Y) = H(X), knowing Y will not help.
  Conversely, if H(X|Y) = 0, then X can be estimated with no error. We
  now consider all the cases in between.

  Let   X   p( X ) . Observe Y, p(y|x). From Y calculate
                        
        g (Y )  X
                                
        X    Y  X form a Markov chain (X hat is
  conditionally independent of X). Probability of error is defined.
                          
        P  Pr{ X
         e                      X}




                          INFORMATION THEORY                          39
•   (T15) Fano‟s Inequality

     H ( P )  P log( A  1)  H ( X | Y )
             e        e
    weaker inequality
    1  P log( A)  H ( X | Y )
         e

             H (X | Y ) 1
     P 
      e
                log A


    where |A| is the set size.
                                           
    Proof:     Define error event E = 1   XX
                                           
                                       0 XX
     H(E,X|Y) = H(X|Y) + H(E|X,Y)                (*)
                              =0
            chain rule no error if X is known.



                          INFORMATION THEORY           40
 Alternative Expansion

               H ( E, X | Y )  H ( E | Y )  H ( X | E, Y )

   conditioning     H ( E | Y )  H ( E )  Pe log Pe  (1  Pe )log(1  Pe )  H ( Pe )
  reduces entropy

       H ( X | E , Y )  Pr( E  0) H ( X | Y , E  0)  Pr( E  1) H ( X | Y , E  1)
             (1  Pe )0  Pe log(| A | 1)
                                                                            (**)

                  
  Given E=1  X  X , then H(X|Y,E = 1) is bound by the number of
                   
  remaining outcomes log (|A|-1) (T8 on slide 50)
  From (*) and (**) we get Fano‟s inequality

                               H ( X | Y )  H ( Pe )  Pe log(| A | 1)




                                  INFORMATION THEORY                                       41

								
To top