Docstoc

Benjamin Van Durme and Ashwin Lall

Document Sample
Benjamin Van Durme and Ashwin Lall Powered By Docstoc
					            Probabilistic Counting
           with Randomized Storage
                      Benjamin Van Durme and Ashwin Lall




Thursday, July 16, 2009
 Data Overload


                •         Lots of text (, images,
                          audio, ...) is good

                •         But how to process it
                          all?

                •         Approximate algorithms!


                                                    Make the best of what you’ve got

     2                                               IJCAI 2009          Van Durme & Lall
Thursday, July 16, 2009
 Data Overload
                                                           More data equals
                                                            better results

                •         Lots of text (, images,
                          audio, ...) is good

                •         But how to process it
                          all?

                •         Approximate algorithms!


                                                    Make the best of what you’ve got

     2                                               IJCAI 2009          Van Durme & Lall
Thursday, July 16, 2009
 Data Overload
                                                           More data equals
                                                            better results

                •         Lots of text (, images,
                          audio, ...) is good

                •         But how to process it              Buy/rent a data center?
                          all?

                •         Approximate algorithms!


                                                    Make the best of what you’ve got

     2                                               IJCAI 2009           Van Durme & Lall
Thursday, July 16, 2009
 Bulky Data




                      1980 ... 1985 ... 1990 ... 1995 ... 2000 ... 2005


     3                                   IJCAI 2009             Van Durme & Lall
Thursday, July 16, 2009
 Bulky Data in Small Space




                      1980 ... 1985 ... 1990 ... 1995 ... 2000 ... 2005


     4                                   IJCAI 2009             Van Durme & Lall
Thursday, July 16, 2009
 Bulky Data in Small Space Online?


                                 +                  +                 +



      1980                ...   2000        ...     ...     ...




     5                                 IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Outline

                     •    Storing Static Counts

                     •    Counting Online

                     •    Experiments

                     •    Additional Comments




     6                                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Outline

                     •    Storing Static Counts

                     •    Counting Online

                     •    Experiments

                     •    Additional Comments




     7                                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Bloom Filters [Bloom ’70]

                     •    Records set
                          membership.

                     •    No false negatives.

                     •    Some false positives.

                     •    Think hashtables, where
                          you throw away the key.




     8                                            IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Bloom Filters [Bloom ’70]

                     •    Records set
                          membership.
                                                               Insert(x)
                     •    No false negatives.

                     •    Some false positives.

                     •    Think hashtables, where
                          you throw away the key.




     8                                            IJCAI 2009       Van Durme & Lall
Thursday, July 16, 2009
 Bloom Filters [Bloom ’70]

                     •    Records set
                          membership.
                                                               Insert(x)
                     •    No false negatives.

                     •    Some false positives.

                     •    Think hashtables, where
                          you throw away the key.




     8                                            IJCAI 2009       Van Durme & Lall
Thursday, July 16, 2009
 Bloom Filters [Bloom ’70]

                     •    Records set
                          membership.
                                                               Insert(x)
                     •    No false negatives.

                     •    Some false positives.

                     •    Think hashtables, where
                          you throw away the key.




     8                                            IJCAI 2009       Van Durme & Lall
Thursday, July 16, 2009
 Bloom Filters [Bloom ’70]

                     •    Records set
                          membership.

                     •    No false negatives.

                     •    Some false positives.

                     •    Think hashtables, where
                          you throw away the key.




     8                                            IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Bloom Filters [Bloom ’70]

                     •    Records set
                          membership.
                                                               Lookup(y)
                     •    No false negatives.

                     •    Some false positives.

                     •    Think hashtables, where
                          you throw away the key.




     8                                            IJCAI 2009          Van Durme & Lall
Thursday, July 16, 2009
 Bloom Filters [Bloom ’70]

                     •    Records set
                          membership.
                                                               Lookup(y)
                     •    No false negatives.

                     •    Some false positives.

                     •    Think hashtables, where
                          you throw away the key.




     8                                            IJCAI 2009          Van Durme & Lall
Thursday, July 16, 2009
 Bloom Filters ...

                     •    Bloom filters are nice
                          when you can tolerate
                          small false positives.            Insert(x)

                     •    And your x’s are large.

                     •    For example, Language
                          Modeling.




     9                                         IJCAI 2009          Van Durme & Lall
Thursday, July 16, 2009
 Motivation: n-grams for MT
          ...
         the dog
      dog barked
       barked at
          ...




     10                   IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Motivation: n-grams for MT
            ...
         the dog 97
      dog barked 42
       barked at 58
            ...




     11                   IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Motivation: n-grams for MT
            ...                                    ...
         the dog 97
      dog barked 42
                                 ?                                ?
       barked at 58                       The cat barked ...
            ...           The dog barked ...             Dog barked ...




                                 IJCAI 2009               Van Durme & Lall
Thursday, July 16, 2009
 Motivation: n-grams for MT
            ...                                    ...
         the dog 97
      dog barked 42
                                 ?                                ?
       barked at 58                       The cat barked ...
            ...           The dog barked ...             Dog barked ...




     13                          IJCAI 2009               Van Durme & Lall
Thursday, July 16, 2009
 Motivation: n-grams for MT
            ...                                    ...
         the dog 97
      dog barked 42
                                 ?                                ?
       barked at 58                       The cat barked ...
            ...           The dog barked ...             Dog barked ...




     14                          IJCAI 2009               Van Durme & Lall
Thursday, July 16, 2009
                            Log-frequency Bloom filter
                   3.1 return BF                                                                                             ea
                                                                                                                    quency dec
                   The efficiency of our scheme for storing n-gram fal
 Storing Counts with Bloom Filters                                                                                  overestimat
                  Randomised LanguageModelling for Statistical Zipf-like distribu- errone
                      3.1 Language Modelling for filterthe Machine Translation
                Randomised Log-frequency BloomStatistical Machine Translation                                       each     oc
                   statistics within a BF relies on
                      The efficiency of our schemein natural language false positiv
                   tion of n-gram frequencies for storing n-gram cor-
                      Randomised Language Modelling for Statistical Machine Translation
                      statistics within a BF relies Milesthe Zipf-like distribu- occur togeth
                                             David Talbot and on Osborne
                                                                                                                             3.2
                                        events occur in extremely small number
                   pora: mostSchooloffrequenciesan natural language cor-
                                            David Talbot and MilesOsborne
                      tion of n-gram David Talbot and Miles OsborneEdinburgh
                                                  Informatics, University of
                                    School ofInformatics, University of Edinburgh
                   of pora: mostBuccleuchInformatics,an extremelyUKUK number 3.2 Sub-s
                       times, while a small number 9LW,small frequent.
                                     2 2 BuccleuchPlace, Edinburgh,EH8 are very
                                          School of Place, Edinburgh, of Edinburgh UK
                                                               University EH8 9LW,                                           Th
                                         events occurEdinburgh, EH8 9LW,
                            d.r.talbot@sms.ed.ac.uk, miles@inf.ed.ac.uk
                         d.r.talbot@sms.ed.ac.uk, miles@inf.ed.ac.uk
                                          2 Buccleuch Place,
                      We quantise araw frequencies, c(x), using a loga-errorpo
                      of times, while small number are very frequent.
                                 d.r.talbot@sms.ed.ac.uk, miles@inf.ed.ac.uk                                        The       an
                          We quantise rawas follows, c(x), using a loga- positive rate
                   rithmic codebook frequencies,                                                                             SM
                      rithmic codebook as follows,                                                                  SMT decodde
                              Abstract                          lenges in deploying large LMs are not In- depend on
                            Abstractqc(x) = 1 + creasing the order of an n-gram model can result intrivial. In- ite
                                    Abstract                    lenges in deploying large LMs are not trivial.trivial. In-

                     A Bloom filter (BF) qc(x) = 1 + log c(x) .
                                        is a randomised data
                                                                   log the order of
                                                                 lenges in deploying. n-gram model not
                                                                creasingb c(x) anlarge LMs are can result in
                                                                                                       (1)
                                                                                                                       (1)
                                                                                                                    items presen
                   A Bloom filter (BF) is a randomised data                an exponentialorder of in the number of parameters;
                                                                           creasing the increase an n-gram model can
                                                                          an exponential increase in the number of parameters; result in
                                                                              b
                            filter for is a randomised
               A Bloom structure(BF)set membership queries. Its
                   structure for set membership queries. data       Its
                                                                          for corpora such the Englishin the number of parameters;
                                                                           an exponential increase Gigaword corpus, for
                                                                          for corpora such as as the English Gigaword corpus, for
                           space requirements are significantly
                   space for set membership queries. below
               structure requirements are significantly below      Its     instance, there are 300 million distinct trigrams and            Err
                            The precision of this codebook decays exponentially
                       The precision of this codebook decays exponentially
                           lossless information-theoretic lower bounds
                   lossless information-theoretic lower bounds
                           but it produces significantly below
               space requirements arefalse positives with some
                                                                          instance, there are as the English Gigaword corpus, for
                                                                           for corpora such 300 million distinct trigrams and
                                                                          over 1.2 billion 5-grams. Since a LM may be queried
                                                                          over 1.2 billion per sentence, it should may re-
                                                                          millions of times 5-grams. Since a LMideallybe queried and
                                                                           instance, there are 300 million distinct trigrams
                   but information-theoreticHere with some
                            produces probability. lower bounds
               lossless it quantifiable false positives we explore the
                            with the raw counts and scale is determined by This
                       with the raw counts and thethe scale is determined byimplies
               but quantifiable probability. Here we explore the
                           use of BFs for positives with some
                    it produces falselanguage modelling in statis-
                                                                                              Th
                                                                          millions of times per sentence, it a LM ideally re-
                                                                          side locallybillion 5-grams. Since shouldmay be queried
                                                                           over 1.2 in memory to avoid time-consuming re-
                                                                          side or disk-based look-ups.
                                                                          mote locally in memory to avoid time-consuming re- re-
                                                                           millions of times per sentence, it should ideally
                                          of the logarithm b; examine the effect effect of
                            the baseof the logarithm b; we we examine the of structure,str
                   use of tical machine translation.
                           BFs for language modelling in statis-
                       the base modelling in statis-
               quantifiable probability. Here we explore the                                    th
                                                                          mote or disk-based look-ups.
                                                                             Against this background, we consider a radically
                                                                                                              avoid time-consuming re-
                                                                           side locally in memory tomodelling: instead
                   tical machine translation.
                           We show how a BF containing n-grams can        different approach to language we consider a radically
                                                                             Against this background,
               use of BFs this parameter in experiments below.
                            for language                                              components
                                                                           mote or disk-based look-ups.
                                                                          of explicitly storing all distinct n-grams, we store a
                   We this translation.
               tical machine parameter in experiments below.                                  coa
                          enable us to use much larger corpora and        different approach to language modelling: instead
                       show how a BF containing n-grams can               randomised representation. In particular, we show a radically
                                                                              Against storing all distinct n-grams, we
                                                                                                                     consider
                                                                          of explicitly this background, weBF), a sim- store a
                          higher-order models complementing a con-
                   enable ventional n-gram the quantised count qc(x) for an n-gram x,
                                Given LM within an SMTcan
                           us to use much larger corpora and    sys-      ACL 2007      We take
                                                                          that the Bloom filter (Bloom (1970);
               We higher-ordera models complementing a con-
                    show how BF containing n-grams                         different approach to language modelling:
                                                                          randomised representation. In particular, we instead
                          tem. We also consider (i)quantised count qc(x) for an n-gram x,
                            Given the how toSMT and
                            the filter within an
                           to use frequency trained by entering composite events con-
               enable us proximate much is information efficiently
                                                         include ap-
                   ventional n-gram LM larger corpora sys-
                                                                          ple space-efficient randomised data structure for rep- show
                                                                                      gram event
                                                                           of the Bloom filter all distinct n-grams,
                                                                          thatexplicitly storing (Bloom (1970); BF), we store a
                                                                          resenting sets, may be used to represent statistics a sim-
     15
                       the rate ofBF complementing a by entering an integer counter Durme of an
                              filterof thebyn-gramIJCAI 2009
                                         is howan SMT sys-
                   tem. Wesistingand (i)trained the appended bycomposite events con- & Lall
               higher-orderalsoaconsider (ii) how to reduce con-
                              models
                          within                    to include ap-
                                                                 er-              Van quency gra
                                                                          from larger corpora and for higher-order particular, we show
                                                                           randomised representation. In n-grams to
                                                                          ple space-efficient randomised data structure for rep-
               ventional ror frequencymodels first efficiently
                          n-gram these within
                   proximate        LM information checking for           complement a conventional smoothed trigram model
                                                                           that the sets, may be (Bloom (1970); BF), a
                                                                          resenting Bloom filter used to represent statistics sim-
Thursday, July 16, 2009                                                                          1
 Storing Counts

                                              ...         ...
            •      Multiple layers of     qc(x) = 3
                   Bloom filters.
                                          qc(x) = 2
            •      Store exponent,
                                          qc(x) = 1
                   in unary.



                                        c(x) ≈ bqc(x)−1



     16                                     IJCAI 2009          Van Durme & Lall
Thursday, July 16, 2009
 Outline

                     •    Storing Static Counts

                     •    Counting Online

                     •    Experiments

                     •    Additional Comments




     17                                       IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter




                             The Spectral Bloom Filter (SBF) replaces
                          the bit vector V with a vector of m counters, C.




                                                          SIGMOD 2003
     18                                                               Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter [Cohen & Matias ’03]




     19                   IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter [Cohen & Matias ’03]

                          Insert(x)




                          1      1      1


     20                           IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter [Cohen & Matias ’03]

                          Insert(x)




                          2      2      2


     21                           IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter [Cohen & Matias ’03]

                          Insert(x)




                          3      3      3


     22                           IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter [Cohen & Matias ’03]

                          Insert(x)




                          4      4      4


     23                           IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter [Cohen & Matias ’03]

                          Insert(y)




                          5      5      4 1


     24                           IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter [Cohen & Matias ’03]

                              Lookup(x)




                          5    5     4 1


     25                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Spectral Bloom Filter [Cohen & Matias ’03]

                              Lookup(x)




                          5    5     4 1


     26                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Collect Counts Online

                     •    Count in log-scale, to
                          save space.




     27                                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Collect Counts Online

                     •    Count in log-scale, to
                                                                     .
                          save space.
                                                                     .
                                                                     .
                     •    Robert Morris (1978)              b−3
                                                                     b2     1 − b−3
                          gave us a way to do this.
                                                            b−2             1 − b−2
                                                                     b
                                                                     1
                                                                −1
                                                            b               1 − b−1




     28                                        IJCAI 2009                 Van Durme & Lall
Thursday, July 16, 2009
 Morris Bloom Counter

                                                               Lookup(x)
                     •    Spectral Bloom Filter,

                     •    but with Morris style
                          updating.




                                                               5      5      4 1



     29                                           IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Morris Bloom Counter

                                                                Lookup(x)
                 •        Spectral Bloom Filter,

                 •        but with Morris style
                          updating.




                                  b −1    f
                           c(x) ≈                               5      5      4 1
                                   b−1
                                                                15     15     7 1
     30                                            IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Morris Bloom Counter

                                                            Lookup(x)
                     •    Same amount of space
                          as Spectral Bloom Filter,




                                                            5      5      4 1

                                                            15     15     7 1
     31                                        IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Morris Bloom Counter

                                                             Lookup(x)
                     •    Same amount of space
                          as Spectral Bloom Filter,

                     •    gives exponentially
                          larger max-count,




                                                             5      5      4 1

                                                             15     15     7 1
     32                                         IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Morris Bloom Counter

                                                             Lookup(x)
                     •    Same amount of space
                          as Spectral Bloom Filter,

                     •    gives exponentially
                          larger max-count,

                     •    but false positives can
                          therefore have higher
                          relative error.
                                                             5      5      4 1

                                                             15     15     7 1
     33                                         IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Reduce False Positive Rate

                 •        Morris Bloom Counter,




     34                                       IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Reduce False Positive Rate

                 •        Morris Bloom Counter,

                 •        split into layers,




     35                                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Reduce False Positive Rate
                                                              Insert(x)
                 •        Morris Bloom Counter,

                 •        split into layers,

                 •        with different hash
                          functions per layer.




     36                                          IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Reduce False Positive Rate
                                                              Insert(x)
                 •        Morris Bloom Counter,

                 •        split into layers,

                 •        with different hash
                          functions per layer.




     37                                          IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Reduce False Positive Rate
                                                              Insert(x)
                 •        Morris Bloom Counter,

                 •        split into layers,

                 •        with different hash
                          functions per layer.




     38                                          IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
 Reduce False Positive Rate
                                                              Insert(x)
                 •        Morris Bloom Counter,

                 •        split into layers,

                 •        with different hash
                          functions per layer.




     39                                          IJCAI 2009         Van Durme & Lall
Thursday, July 16, 2009
Talbot Osborne Morris Bloom (TOMB)
              Counter
                     •    Combination of Morris
                          Bloom Counter with
                          Talbot Osborne count
                          storage.

                     •    Stay tuned for related
                          work by Talbot.




     40                                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Tradeoff

                 •        Trade number of
                          layers for expressivity.


                                                           M=         2   hi
                                                                               −1
                                                                  i




     41                                              IJCAI 2009            Van Durme & Lall
Thursday, July 16, 2009
 Tradeoff

            •      Trade number of layers
                   for expressivity.



                                                                               h=4




                          0, 1, 2, ..., 14, 15           M=       2hi − 1
                                                              i



     42                                     IJCAI 2009                 Van Durme & Lall
Thursday, July 16, 2009
 Tradeoff

            •      Trade number of layers
                   for expressivity.


                                                                               h2 = 2


                                                                               h1 = 2


                          0, 1, 2, 3, 4, 5, 6            M=
                                                              i
                                                                  2hi − 1



     43                                     IJCAI 2009                 Van Durme & Lall
Thursday, July 16, 2009
 Tradeoff

            •      Trade number of layers
                   for expressivity.
                                                                               h4 = 1

                                                                               h3 = 1
                                                                               h2 = 1
                                                                               h1 = 1


                          0, 1, 2, 3, 4                  M=       2hi − 1
                                                              i



     44                                     IJCAI 2009                 Van Durme & Lall
Thursday, July 16, 2009
 Tradeoff

            •      Trade number of layers
                   for expressivity.


                                                                               h2 = 3


                                                                               h1 = 1


                          1, 2, 3, ..., 7, 8             M=
                                                              i
                                                                  2hi − 1



     45                                     IJCAI 2009                 Van Durme & Lall
Thursday, July 16, 2009
 “Layers”
                                                                   h5 , h6 , h7 = 3
          •    Layers are a useful
               visualization.

          •    In practice, consecutive
               layers of equal height are
               implemented as single
               vectors with sets of hash
               functions.


                                                         h1 , h2 , h3 , h4 = 1



     46                                     IJCAI 2009                    Van Durme & Lall
Thursday, July 16, 2009
 Outline

                     •    Storing Static Counts

                     •    Counting Online

                     •    Experiments

                     •    Additional Comments




     47                                       IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Experiment: Count Accuracy
                                        Count all trigrams in Gigaword,
                                         randomly query 1,000 values,
                                              compare to truth
                          8




                                                                                  8
                          6




                                                                                  6
          log Frequency




                                                                  log Frequency
                          4




                                                                                  4
                          2




                                                                                  2
                          0




                                                                                  0
                              0   200   400   600   800    1000                       0   200    400    600   800   1000
                                          Rank                                                     Rank

                                    100 MB                                                      500 MB
     48                                                   IJCAI 2009                                   Van Durme & Lall
Thursday, July 16, 2009
                                       >0        >1        >2         >3
                           100       86.5%     74.2%     66.1%     43.5%
 Experiment: MT            500       26.9%      6.7%      1.8%      0.3%
                          2,000      10.9%      0.9%      0.1%      0.0%
                Build counters with varying amounts of memory
     Table 1: False positive rates when using indicator functions
     I>0 , ..., I>3 . A perfect counter has a rate of 0.0% using I>0 .

                 T RUE      260MB     100MB      50MB      25MB      NO LM
                 22.75       22.93     22.27     21.59     19.06     17.35
                    -        22.88     21.92     20.52     18.91       -
                    -        22.34     21.82     20.37     18.69       -

     Table 2: BLEU scores using language models based on true counts,
     compared (based on system of Post & Gildea ’08) counters.
                to approximations using various size TOMB
     Three trials for each counter are reported (recall Morris counting
     is probabilistic, and thus results may vary between similar trials).


     4.3
     49
                      Language Models for Machine Translation
Thursday, July 16, 2009
                                         IJCAI 2009                 Van Durme & Lall
                                       >0        >1        >2         >3
                           100       86.5%     74.2%     66.1%     43.5%
 Experiment: MT ...        500       26.9%      6.7%      1.8%      0.3%
                          2,000      10.9%      0.9%      0.1%      0.0%

     Table 1: False positive rates when using indicator functions
     I>0 , ..., I>3 . A perfect counter has a rate of 0.0% using I>0 .

                 T RUE      260MB     100MB      50MB      25MB      NO LM
                 22.75       22.93     22.27     21.59     19.06     17.35
                    -        22.88     21.92     20.52     18.91       -
                    -        22.34     21.82     20.37     18.69       -

     Table 2: BLEU scores using language models based on true counts,
     compared to approximations using various size TOMB counters.
                             counter are reported (recall size
     Three trials for each Three runs per counter Morris counting
     is probabilistic, and thus results may vary between similar trials).


     4.3
     50
                      Language Models for Machine Translation
Thursday, July 16, 2009
                                         IJCAI 2009                 Van Durme & Lall
                          2,000      10.9%   0.9%     0.1%     0.0%
 Experiment: MTrates when
  Table 1: False positive
                          ...                    using indicator functions
     I>0 , ..., I>3 . A perfect counter has a rate of 0.0% using I>0 .

                 T RUE      260MB    100MB 50MB       25MB      NO LM
                 22.75       22.93    22.27   21.59    19.06    17.35
                    -       22.72
                             22.88   22.00 20.83
                                      21.92   20.52   18.89
                                                       18.91      -
                    -        22.34     (average)
                                      21.82   20.37    18.69      -

     Table 2: BLEU scores using language models based on true counts,
                   23
     compared to approximations using various size TOMB counters.
                     17.25
     Three trials for each counter are reported (recall Morris counting
                       and
     is probabilistic, 11.5 thus results may vary between similar trials).

                             5.75

     4.3              Language Models for Machine Translation
                             0
                                            in No LM
     As an example of approximate counts25MBpractice, we follow
                     True 260MB 100MB 50MB

     Talbot and Osborne [2007]IJCAI 2009
     51
Thursday, July 16, 2009
                                in constructing a n-gram language
                                                      Van Durme & Lall
 Outline

                     •    Storing Static Counts

                     •    Counting Online

                     •    Experiments

                     •    Additional Comments




     52                                       IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Related

                     •    Applies method of                 •   Rare elements
                          Manku and Motwani ’02.                discarded.

                     •    Track most frequent               •   Strong guarantee on
                          elements in stream.                   counts for top elements.


                           Streaming for large scale NLP: Language Modeling


                                                e
                           Amit Goyal, Hal Daum´ III, and Suresh Venkatasubramanian
                                    University of Utah, School of Computing
                                   {amitg,hal,suresh}@cs.utah.edu


                                                                    NAACL 2009

     53                      Abstract                     Cutoff
                                                 IJCAI 2009          Size    BLEU    NIST
                                                                                     Van      MET
                                                                                             Durme & Lall
                                                            Exact   367.6m   28.73   7.691   56.32
Thursday, July 16, 2009
 Data that is not text

                     •    Not just for Comp. Ling.

                     •    E.g., count n-grams over
                          “vocabularies” based on
                          SIFT features.




     54                                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Humans

                     •    People store large
                          amounts of information
                          in their heads,

                     •    and they do it online.

                     •    Space efficient online
                          counting provides
                          additional area for
                          interfacing with Cog. Sci.
                          community.



     55                                         IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Acknowledgements

                     •    Ashwin Lall (co-author)




     56                                       IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Acknowledgements

                     •    Ashwin Lall (co-author)

                     •    David Talbot,
                          Miles Osborne




     57                                       IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Acknowledgements

                     •    Ashwin Lall (co-author)

                     •    David Talbot,
                          Miles Osborne

                     •    Matt Post,
                          Nick Morsillo,
                          Dan Gildea




     58                                       IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009
 Questions?

          www.cs.rochester.edu/~vandurme



                          www.cc.gatech.edu/~alall

     59                        IJCAI 2009   Van Durme & Lall
Thursday, July 16, 2009

				
DOCUMENT INFO