MidTerm by dandanhuanghuang


									600.103 MidTerm Exam (4 pages) + Extra Credit/Homework (3 pages)


Question 1: Binomial Distribution
Write a Python program called binomial_coef that takes two arguments, n and k, and returns


Run your binomial_coef on the following and verify that you are computing Pascale’s triangle:

      for n in range(9): print [binomial_coef(n, k) for k in range(n+1)]

What is the next row of Pascale’s triangle? 1, 6 and then what?

Use your binomial_coef function to compute binomial_prob. Thus, if we have a fair coin (p = ½), and we
toss it (n=2) times, then there is a ¼ prob of k=0, and ½ prob of k=1 and ¼ prob of k=2.

Check the Python function with the following in R:

       table(rbinom(1e6, 2, 0.5))/1e6

Let’s use your binomial_prob function to model the distribution of “the” in text. Suppose that the next
word is “the” with probability p=0.05. Fill out the following table with the probability of seeing k
instances of “the” in a sample of 100 words of text. Use both R and Python to compute these numbers.
Round probabilities to hundredths.

         k       0           1           2           3           4           5           6           7

Consider words with p between 0.007 and 0.008. Fill out the following table to hundredths with
binomial_prob(k, 100, p), for these two values of p and these six values of k.

             k        0             1              2             3             4              5



We can estimate p for a word, w, in a text with p(w) = freq(w)/N, where freq(w) is the number of times
that w appears in the text, and N is the number of words in the text.

What is N? How many words are there in Genesis?

What is freq(‘the’) in Genesis? (How many times does ‘the’ appear in

What is p(‘the’)? That is, what is freq(‘the’)/N?

Which words in Genesis have p between 0.007 and 0.008?

Split Genesis into blocks of 100 words each. There are N/100 such blocks. For words with p between
0.007 and 0.008, we would expect to see them (one or more times) in B blocks where Blow ≤ B ≤ Bhigh

        Blow = N/100 * (1-binomial_prob(0,100,0.007))
        Bhigh = N/100 * (1-binomial_prob(0,100,0.008))

Fill out the following table. The last two columns are N/100 * (1-binomial_prob(0,100,0.007)) and
N/100 * (1-binomial_prob(0,100,0.008)), respectively.

Note: Since the last two columns don’t depend on the word, they will be the same for all words.

Hint: This Unix command counts the number of 100-word blocks that contain “thou.” Use this value for
the observed column (but change “thou” to the appropriate word).

      tr –sc ‘A-Za-z’ ‘\n’ < genesis.txt | egrep . | awk ‘/^thou$/ {print int(NR/100)}’ | sort –u | wc -l

         word                     observed                      Blow                       Bhigh

Is the observed column more than expected, or less than expected? That
is, is the 2nd column larger than the last two columns (or smaller)?

End of MidTerm (Everything after this is extra credit/homework).
Homework is due at dawn before the next class (after spring break)

Why isn’t the 2nd col between the last two cols? What’s wrong with the
binomial assumption?

Question 2: Fibonacci

fib0, fib1 and fib2 are three programs that almost compute Fibonacci in R.

        fib0 = function(n) if(n <= 1) 1 else fib0(n-1) + fib0(n-2)
        fib1 = function(n) matpow(matrix(data=c(1,1,1,0), ncol=2), n)[1,1]
        fib2 = function(n) round(((1+sqrt(5))/2)^n/sqrt(5))

        # note: pow is the divide-and-conquer log n method of powering a number
        pow = function(x,n) {
           if(n==1) x
           else if(even(n)) pow(x, n/2)^2
           else x * pow(x,n-1) }

        even = function(n) n == floor(n/2) * 2

        # matpow uses the same divide-and-conquer log n method to power a matrix
        # note: %*% is matrix multiplication
        matpow = function(x,n) {
          if(n==1) x
          else if(even(n)) matsquare(matpow(x, n/2))
          else x %*% matpow(x,n-1) }

        matsquare = function(x) x %*% x

Run all three implementations of fib in R and compare their output to the truth:

        F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17              F18   F19   F20

        0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765

Specifically, fill out the following table.

Hint: fill in the first column with for(n in 0:7) print(fib0(n)).

                  n         truth                fib0               fib1           fib2

                  0           0

                  1           1

                  2           1

                  3           2

                  4           3

                  5           5

                  6           8

                  7          13

What’s wrong with these programs?

Can you fix them? Specifically, modify all three programs so they produce the truth (at least for these 8




How fast are these three programs? Which implementation is faster? How does time grow with n?
Answer the question both empirically and theoretically.

To show growth with n empirically, write down the elapsed time in each cell of the table below.

Hint: You can fill out the first column with: for(i in 25:30) print (system.time(fib0(i)))

Hint: Some of these implementations are so fast that system.time will report 0 elapsed time. If that
happens, try system.time(for(i in 1:1000) fib1(30)) and then divide the elapsed time by 1000.

                    n           fib0                   fib1                   fib2







Theoretically, using the methods discussed in last week’s lecture, how would we expect time to grow
with n?





To top