Random text

Document Sample
Random text Powered By Docstoc
					     A Random Text Model
     for the Generation of
Statistical Language Invariants
            Chris Biemann
     University of Leipzig, Germany



   HLT-NAACL 2007, Rochester, NY, USA
         Monday, April 23, 2007
                                        1
                 Outline

• Previous random text models

• Large-scale measures for text

• A novel random text model

• Comparison to natural language text

                                        2
            Necessary property: Zipf‘s Law
• Zipf: Ordering words in a corpus by descending frequency, the
  relation between the frequency of a word at rank r and its rank is
  given by f(r) ~ r-z, where z is the exponent of the power-law that
  corresponds to the slope of the curve in a log plot. For word
  frequencies in NL, z  1
• Zipf-Mandelbrot: f(r) ~(r+c1)-(1+c2): Approximates lower
  frequencies for very high ranks
                                 rank-frequency

                                              spoken English
                                             power law z=1.4
                                Zipf-Mandelbrot c1=10 c2=0.4
               10000


                1000
frequency




                 100


                  10


                   1
                       1   10       100           1000         10000   3
                                     rank
  Previous Random Text Models
B. B. Mandelbrot (1953)
• Sometimes called the “monkey at the typewriter”
• With a probability w, a word separator is generated at each step,
• with probability (1-w)/N, a letter from an alphabet of size N is generated


H. A. Simon (1955)
• No alphabet of single letters
• at each time step, a previously unseen new word is added to the
  stream with a probability , whereas with probability (1-), the next
  word is chosen amongst the words at previous positions.
• frequency distribution that follows a power law with exponent z=(1-).
• Modified by Zanette and Montemurro (2002):
  - sublinear growth for higher exponents
  - Zipf-Mandelbrot law by maximum probability threshold
                                                                       4
    Critique on Previous Models
• Mandelbrot: All words with the same length are
  equiprobable, as all letters are equiprobable
   Ferrer i Cancho and Solé (2002): Initialisation with letter
  probabilities obtained from natural language text solves this
  problem, but where do these letter frequencies come from?

• Simon: No concept of „letter“ at all.

• Both:
   – no concept of sentence
   – no word order restrictions: Simon = bag of words,
     Mandelbrot does not take into account generated
     stream at all
                                                                  5
  Large-scale Measures for Text
• Zipf„s law and lexical spectrum: rank-frequency plot should
  follow a power law with z1, frequency-spectrum
  (probability of frequencies) should follow a power law with
  z2 (Pareto distribution)
• Word length: Should be distributed like in natural language
  text, according to a variant of the gamma distribution
  (Sigurd et al. 2004)
• Sentence length: Should also distributed like in NL, same
  gamma distribution
• Significant neighbour-based co-occurrence graph: Should
  be a similar in terms of degree distribution and connectivity
  in random text and NL.
                                                           6
    A Novel Random Text Model
Two parts:
• Word Generator
• Sentence Generator

Both follow the principle of beaten tracks:
• Memorize what has been generated before
• Generate with higher probability if generated before more
   often

Inspired by Small World network generation, especially
   (Kumar et al. 1999).
                                                         7
• Initialisation:         Word Generator
    – Letter graph of N letters.
    – Vertices are connected to themselves with weight 1.
• Choice:
    – When generating a word, the generator chooses a letter x according to its
      probability P(x), which is computed as the normalized weight sum of
      outgoing edges:
          P( x) 
                   weightsum( x)   with weightsum( y)   weight( y, u)
                   weightsum(v)
                    vV
                                                       uneigh( y )


• Parameter:
    – At every position, the word ends with a probability w(0,1) or generates a
      next letter according to the letter production probability as given above.
• Update:
    – For every letter bigram, the weight of the directed edge between the
      preceding and current letter in the letter graph is increased by one.
• Effect: self-reinforcement of letter probabilities:
    – the more often a letter is generated, the higher its weight sum will be in
      subsequent steps,
                                                                                   8
    – leading to an increased generation probability.
Word Generator Example

               The small numbers
               next to edges are
               edge weights. The
               probability for the
               letters for the next
               step are

                P(A)=0.4
                P(B)=0.4
                P(C)=0.2

                                 9
Measures on the Word Generator
                 rank-frequency                                                   lexical spectrum
                                                                    1
                        word generator w=0.2                                              word generator w=0.2
                               power law z=1                                                     power law z=2




                                         P(frequency)
                           Mandelbrot model                        0.1                       Mandelbrot model
10000

                                                                  0.01
 1000
                                                                 0.001
  100
                                                                0.0001

   10
                                                                1e-005


    1                                                           1e-006
        1   10      100           1000                  10000            1   10          100          1000
                     rank                                                            frequency

• Word Generator fulfills measures much better than the
  Mandelbrot model.
• For other measures, we need something extra...                                                                 10
           Sentence Generator I
• Initialisation:
   – Word graph is initialized with a begin-of-sentence (BOS) and an
     end-of-sentence (EOS) symbol, with an edge of weight 1 from BOS
     to EOS.
• Word Graph: (directed)
   – Vertices correspond to words
   – edge weights correspond to the number of times two words were
     generated in a sequence.
• Generation:
   – random walk on the directed edges starts at the BOS vertex.
   – With a new word probability (1-s), an existing edge is followed from
     the current vertex to the next vertex
   – the probability of choosing endpoint X from the endpoints of all
     outgoing edges from the current vertex C is given by
                                     weight(C , X )
                   P( word  X ) 
                                      weight(C, N )
                                 Nneigh( C )
                                                                   11
         Sentence Generator II
• Parameter:
  – With probability s  (0,1), a new word is generated by the word
    generator model
  – next word is chosen from the word graph in proportion to its
    weighted indegree: the probability of choosing an existing vertex E
    as successor of a newly generated word N is given by
                                      indgw( E )
                    P( word  E )               ,
                                       indgw(v)
                                      vV

                    indgw( X )   weight (v, X )
                                vV



• Update:
  – For each sequence of two words generated, the weight of the
    directed edge between them is increased by 1

                                                                  12
Sentence Generator Example
               • In the last step, the
                 second CA was
                 generated as a new
                 word from the word
                 generator.
               • The generation of
                 empty sentences
                 happens frequently.
                 These are omitted in
                 the output.



                                     13
Comparison to Natural Language
• Corpus for comparison: The first 1 million words of BNC, spoken
  English.
• 26 letters, uppercase, punctuation removed  same in word generator
• 125,395 sentences  set s=0.08, remove first 50K sentences
• average sentence length: 7.975 words
• Average word length: 3.502 letters  w=0.4
        OOH
        OOH
        ERM
        WOULD LIKE A CUP OF THIS ER
        MM
        SORRY NOW THAT S
        NO NO I DID NT
        I KNEW THESE PEWS WERE HARD
        OOH I DID NT REALISE THEY WERE THAT BAD
        I FEEL SORRY FOR MY POOR CONGREGATION
                                                                 14
                             Word Frequency
                              rank-frequency

                                      sentence generator
                                                 English           • Zipf-Mandelbrot
                                         power law z=1.5
            10000                                                    distribution
                                                                   • Smooth curve
             1000
frequency




                                                                   • Similar to English
              100


               10


                1
                    1   10       100           1000        10000
                                  rank

                                                                                          15
                      Word Length

               word length                 • More 1-letter words in the
                      sentence generator
                                             sentence generator
100000                           English
                      gamma distribution   • Longer words in the
 10000                                       sentence generator
                                           • Curve is similar
  1000
                                           • Gamma distribution here:
   100                                       f(x)~x1.50.45x

    10


     1
         1                       10
             length in letters


                                                                   16
              Sentence Length
            sentence length
                                          • Longer sentences in
                    sentence generator
10000                          English      English
                                          • More 2-word sentences
 1000
                                            in english
                                          • Curve is similar
  100


   10


    1
        1      10                   100
            length in words

                                                             17
Neighbor-based Co-occurrence Graph
                               degree distribution

                                          sentence generator
                                                     English
    10000
                                              word generator             • Min. cooc. freq=2,
                                               power law z=2
     1000                                                                  min. log likelihood
      100                                                                  ratio=3.84
       10                                                                • NB-graph is a small
        1
                                                                           world
      0.1
                                                                         • Qualitatively, English
                                                                           and sentence
     0.01
                                                                           generator are similar
    0.001
            1             10          100            1000
                                                                         • Word generator
                                 degree interval                           shows much much
                English        sentence     word gen.       random         less co-occurrences
                sample         gen.                         graph (ER)
                                                                         • Factor 2 in clustering
# of ver.       7154           15258        3498            10000
                                                                           coefficient and
avg. sht.       2.933          3.147        3.601           4.964
path                                                                       number of vertices
avg. deg.       9.445          6.307        3.069           7
cl.coeff.       0.2724         0.1497       0.0719          6.89E-4                         18
z               1.966          2.036        2.007           -
           Formation of Sentences
• Word graph grows and contains
  the full vocabulary used so far for
                                                               sentence length growth
  generating in every time step.                 100
• Random walks starting from BOS                                               w=0.4 s=0.08
                                                                                w=0.4 s=0.1
  always end in EOS.                                                          w=0.17 s=0.22
                                                                               w=0.3 s=0.09

                          avg. sentence length
• Sentence length slowly increases:                                                x^(0.25);
  random walk has more
  possibilities before finally arriving
                                                 10
  at the EOS vertex.
• Sentence length is influenced by
  both parameters of the model:
    – the word end probability w in the
      word generator
    – the new word probability s in the           1
                                                       10000          100000              1e+006
      sentence generator.                                           text interval



                                                                                         19
                    Conclusion
Novel random text model
• obeys Zipf„s law
• obeys word length distribution
• obeys sentence length
• shows similar nb-cooccurrence data

First model that:
• produces smooth lexical spectrum without initial letter
   probabilities
• incorporates notion of a sentence
• models word order restrictions
                                                            20
    Sentence generator at work
Beginning: Q . U . RFXFJF . G . G . U . R . U . RFXFJF .
   XXF . RFXFJF . U . QYVHA . RFXFJF . R TCW . CV . Z U
   . G . XXF . RFXFJF . M XXF . Q . G . RFXFJF . U .
   RFXFJF . RFXFJF . Z U . G . RFXFJF . RFXFJF . M XXF
   . R . Z U .
Later: X YYOXO QO OEPUQFC T TYUP QYFA FN XX TVVJ U OCUI
   X HPTXVYPF . FVFRIK . Y TXYP VYFI QC TPS Q UYYLPCQXC
   . G QQE YQFC XQXA Z JYQPX. QRXQY VCJ XJ YAC VN PV
   VVQF C XJN JFEQ QYVHA. U VIJ Q YT JU OF DJWI QYM U
   YQVCP QOTE OD XWY AGFVFV U XA YQYF AVYPO CDQQ TY NTO
   FYF QHT T YPXRQ R GQFRVQ . MUHVJ Q VAVF YPF QPXPCY Q
   YYFRQQ. JP VGOHYY F FPYF OM SFXNJJ A VQA OGMR L QY .
   FYC T PNXTQ . R TMQCQ B QQTF J PVX YT DTYO RXJYYCGFJ
   CYFOFUMOCTM PQRYQQYC AHXZQJQ JTW O JJ VX QFYQ YTXJTY
   YTYYFXK . RFXFJF JY XY RVV J YURQ CM QOXGQ QFMVGPQ.
   OY FDXFOXC. N OYCT . L MMYMT CY YAQ XAA J YHYJ MPQ
   XAQ UYBX RW XXF O UU COF XXF CQPQ VYYY XJ YACYTF FN
   . TA KV XJP O EGV J HQY KMQ U .
                                                    21
                   Questions?
Danke sdf sehr gf thank fdgf you g fd tusen sd ee takk erte
  dank we u trew wel wwd muchas werwe ewr gracias
  werwe rew merci mille werew re ew ee ew grazie d fsd ffs
  df d fds spassiva fs fdsa rtre trerere rteetr trpemma eedm




                                                        22

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:9/8/2011
language:English
pages:22