Docstoc

LEARNING DISTRIBUTED REPRESENTATIONS FOR STATISTICAL LANGUAGE

Document Sample
LEARNING DISTRIBUTED REPRESENTATIONS FOR STATISTICAL LANGUAGE Powered By Docstoc
					L EARNING DISTRIBUTED REPRESENTATIONS
FOR STATISTICAL LANGUAGE MODELLING




                  1
                          Overview

1. Discrete data and distributed representations

2. Language modelling
  • Factored RBM language model
  • Log-bilinear language model
  • Hierarchical log-bilinear language model




                                2
                        Discrete data

• Discrete data: datapoints with discrete-valued attributes

• When such datapoints are high-dimensional, regression /
 classification / density estimation is hard:
 – Amounts to estimating entries of an exponentially large
   table
   - Attributes correspond to table dimensions
   - Attribute values correspond to indices for the dimensions
 – Data sparsity: little or no data available for most entries
 – No a priori smoothness constraint on table entries
 – No general way to generalize to new table entries




                                  3
             Distributed representations
• Observation: making a model less local often improves
 generalization.
 – In a continuous space: average over datapoints near the point
   of interest.
 – In a discrete space: not clear what to average over.
   - What does “near” mean?
   - No general concept of distance / neighbourhood.

• Working with smooth functions over continuous spaces
 results in automatic smoothing.
 – Similar inputs produce similar outputs

• Idea: map discrete attributes to real-valued vectors and
 learn a smooth function that maps the vectors to the
 desired output values.
 – Learn the attribute mapping jointly with the function.
 – Automatic generalization!
                                 4
            Statistical language modelling

• Goal: Model the joint distribution of words in a sentence.

• Such a model can be used to
  – predict the next word given several preceding ones
  – arrange bags of words into sentences
  – assign probabilities to documents

• Applications: speech recognition, machine translation,
 information retrieval.

• Most statistical language models are based on the Markov
 assumption:
  – The distribution of the next word depends on only n words that
    immediately precede it.
  – This assumption is clearly wrong but useful – it makes the task
    much more tractable.

                                5
                     n-gram    models
• n-gram models are simply conditional probability tables
  for P (wn|w1:n−1).
   – wn is the word to be predicted (the next word)
   – words w1:n−1 = w1, ..., wn−1 are called the context

• n-gram models are estimated by counting the number of
  occurrences of each possible word n-tuple and
 normalizing.
 – smoothing the estimates is essential for good performance
 – many different smoothing methods exist

• n-gram models are the most widely used statistical
 language models due to their simplicity and excellent
 performance.

• Curse of dimensionality: the number of model parameters
  is exponential in n.
                               6
                Neural language models

• Several neural probabilistic language models based on
 distributed representations have been proposed.

• Common approach:
  – Represent each word with a real-valued feature vector
  – Represent the context by the sequence of the context word
    feature vectors
  – Train a neural network to output the distribution for the next
    word from the context representation
  – Learn word feature vectors jointly with other neural net
    parameters

• Neural language models can outperform n-gram language
 models, especially when little training data is available.

• Main drawback: very long training and testing times.

                                 7
          Conditional RBM language model

• Use a restricted Boltzmann machine to model P (wn|w1:n−1)
  – Capture the interaction between wn and w1:n−1 through a vector
    of latent variables.
  – Represent words using low-dimensional real-valued vectors.
    - Rw is the feature vector for word w.

• Energy function:
                                             n
                    E(wn, h; w1:n−1) = −           Rwi Wih
                                             i=1
  – h is the vector of latent variables
  – Wi is the interaction matrix between the feature vector for wi
    and the latent variables.
  – Normalization is done only over wn.

• Both inference and prediction take time linear in the
 number of latent variables.
                                    8
                   Log-bilinear model

• The log-bilinear (LBL) model is perhaps the simplest
 neural language model.

• Given the context w1:n−1, the LBL model first predicts the
  representation for the next word wn by linearly combining
 the representations of the context words:
                                n−1
                           ˆ
                           r=         Cirwi
                                i=1
 – rw is the real-valued vector representing word w
• Then the distribution for the next word is computed based
 on the similarity between the predicted representation and
 the representations of all words in the vocabulary:
                                         exp(ˆT rw )
                                              r
                P (wn = w|w1:n−1) =               Tr )
                                                       .
                                          j exp(ˆ j
                                                r

                                9
  Faster models through structured vocabulary

• Computing the probability of the given next word
  requires considering all N words in the vocabulary.
  – Need to consider all words because the word space is
    unstructured.

• Idea: Organize words in the vocabulary into a binary tree
 and exploit its structure to speed up normalization (Morin
 and Bengio, 2005).
 – Construct a binary tree over words
   - words are associated with leaf nodes
   - one word per leaf
 – Replace the N -way decision by a sequence of O(log N ) binary
   decisions for predicting the next word.
   - Can achieve an exponential speedup if the tree is balanced!



                               10
                 Tree-based factorization




• To define a distribution over leaf nodes:
  – Specify the probability of taking the left branch at each non-leaf
    node.
  – The probability of a leaf node is the product of probabilities of
    the left/right decisions that lead from the root node to the leaf
    node.



                                  11
            Constructing trees over words
• The approach of Morin and Bengio:
  – Start with the WordNet IS-A hierarchy (which is a DAG)
  – Manually select one parent node per word
  – Use clustering to make the resulting tree binary
  – Use the Neural Probabilistic Language Model for making the
    left/right decisions

• Drawbacks:
  – Tree construction process uses expert knowledge
  – The resulting model does not work as well as its
    non-hierarchical counterpart

• Our approach:
  – Construct the word tree from data alone (no experts needed)
  – Allow each word to occur more than once in the tree
  – Use the simplified log-bilinear language model for making the
    left/right decisions
                                12
            Hierarchical log-bilinear model

• Let d be the binary code that encodes the sequence of
  left-right decisions in the tree that lead to word w.

• Each non-leaf node in the tree is given a feature vector.
  – Used for discriminating the words in the left subtree from those
    in the right subtree.

• The probability of taking the left branch at ith node in the
 sequence is
                       P (di = 1|qi, w1:n−1) = σ(ˆT qi),
                                                 r
    ˆ
  – r is computed as in the LBL model
  – qi is the feature vector for the node

• The probability of w being the next word is
                 P (wn = w|w1:n−1) =          P (di|qi, w1:n−1).
                                          i

                                     13
             Data-driven tree construction

• We would like to cluster words based on the distribution
 of contexts in which they occur.

• This distribution is hard to estimate and work with due to
 the high dimensionality of the space of contexts.
  – same difficulties as with estimating n-gram models

• To avoid this problem, we represent contexts using
 distributed representations and cluster words based on
 their expected predicted representation.

• Constructing a tree over words:
  1. Train a model using a (balanced) random tree over words.
  2. Extract the word representations from the trained model.
  3. Perform hierarchical clustering on the extracted
     representations.

                                14
                 Hierarchical clustering

• Hierarchical top-down clustering of feature vectors:
  – At each level, fit a mixture of two Gaussians with spherical
    covariances using EM to the current group of word
    representations.
  – Assign words to mixture components based on the component
    responsibilities.

• We considered several splitting rules:
  – BALANCED: Sort the responsibilities and make the split to
    ensure a balanced tree.
  – ADAPTIVE: Assign the word to the component with the
    greater responsibility.
  – ADAPTIVE(ǫ): Assign the word to a component if its
    responsibility for the word is at least 0.5-ǫ.



                               15
                 Dataset and evaluation

• APNews dataset:
  – collection of Associated Press news stories (16 million words)

• Preprocessing (Bengio et al.):
  – convert all words to lower case
  – map all rare words and proper nouns to special symbols
  – just under 18000 words in the vocabulary

• Models were compared based on the perplexity they
 assigned to the test set.
                                                   1
• Perplexity is the geometric average of     P (wn|w1:n−1) .




                                 16
                    Model evaluation (I)

• Preliminary comparison:
 – 10M training set, 0.5M validation set, 0.5M test set
 – Feature-based models have 100D feature vectors.
 – FRBMs have 1000 hidden units.
 – KNn is a Kneser-Ney back-off n-gram model.

               Model      Context Model test Mixture test
                type       size   perplexity perplexity
               FRBM          2         169.4       110.6
           Temporal FRBM     2         127.3        95.6
             Log-bilinear    2         132.9       102.2
             Log-bilinear    5         124.7        96.5
            Back-off GT3     2         135.3            –
            Back-off KN3     2         124.3            –
            Back-off GT6     5         124.4            –
            Back-off KN6     5         116.2            –



                                   17
                   Model evaluation (II)

• Final comparison:
 – 14M training set, 1M validation set, 1M test set
 – (H)LBL used 100D feature vectors and a context size of 5.
 – KNn is an interpolated Kneser-Ney n-gram model.

  Model   Tree generating       Test Mixture Fitted mix. Minutes
  type    algorithm         perplex. perplex. perplexity per epoch
  HLBL    RANDOM               151.2    107.2      106.0         4
  HLBL    BALANCED             131.3     99.9       99.7         4
  HLBL    ADAPTIVE             127.0     98.3       98.2         4
  HLBL    ADAPTIVE(0.25)       124.4     97.5       97.4         6
  HLBL    ADAPTIVE(0.4)        123.3     97.2       97.1         7
  HLBL    ADAPTIVE(0.4) × 2    115.7     95.3       95.3        16
  HLBL    ADAPTIVE(0.4) × 4    112.1     94.4       94.3        32
   LBL    –                    117.0     94.0       94.0      6420
  KN2     –                    174.2        –          –         –
  KN3     –                    125.6        –          –         –
  KN6     –                    119.2        –          –         –

                                  18
                                  The effect of the context size
                                130
                                                                                     HLBL
                                                                                     KNn
                                125


                                120
         Test test perplexity




                                115


                                110


                                105


                                100


                                95
                                  2   4   6   8        10        12   14   16   18          20
                                                       Context size




• The HLBL models were based on the ADAPTIVE(0.4) × 4 tree.
• KNn is an interpolated modified Kneser-Ney n-gram model.
                                                  19
T HE E ND




    20
     Log-prob contributions: 5-gram vs. LBL (I)
                                             5
                                          x 10
                                     3
                                                                                            KN5
                                                                                            LBL5

                                    2.5



                                     2
            Number of predictions




                                    1.5



                                     1



                                    0.5



                                     0
                                                 1   2   3        4         5   6   7   8
                                                                      Bin




Number of predictions (P (wn|w1:n−1)) on the test set as a function of
the their magnitude. Bin i (for i = 1, ..., 7) contains predictions
between 10−i and 10−i+1. Bin 8 contains predictions smaller than
10−7.
                                                             21
    Log-prob contributions: 5-gram vs. LBL (II)
                                     5
                                  x 10
                             5
                                                                                    KN5
                            4.5                                                     LBL5

                             4

                            3.5
            Sum of −log P



                             3

                            2.5

                             2

                            1.5

                             1

                            0.5

                             0
                                         1   2   3        4         5   6   7   8
                                                              Bin



Contribution to the negative log-probability of the test set as a
function of the prediction magnitude. Bin i (for i = 1, ..., 7) contains
predictions between 10−i and 10−i+1. Bin 8 contains predictions
smaller than 10−7.
                                                     22
    t-SNE embedding of LBL feature vectors (I)

                        official
                       sources
        prosecutorsofficials
           investigators
              authorities                                            deputy
              police                                       director
                                                         chairmanchief
                         leaders
                                 member
                             members
                         supporters                            leader
                                                               president
                                                                    prime_minister
                                                              governor gen
                  reporters                           king                             sen
                                                                                       rep
                        officers forces
                  employees troops
                 workers soldiers                                                                                     p.m.
                                                                                                                      a.m.
                                    rebels
                             serbs
            parents residents
                 students                          majority_leader
            familiespeoplevoters             candidate
              children
                     americans         delegates
                                          candidates
                women                                            republicans
                                                                  democrats
             men
                                        parties
                                          sides
                                   groups
               schools            companies
                                  states
         homes                         countries
                      parts                  united_states
                       areas                           united_nations
                                                        u.n nato                                                        islamic
                                                                                                                             muslim
                                                                                                                              serb
                                                                                    bosnia                              palestinian
                         region world
                      area                                                                      iraq
                                                                                   europe britain                             rebel
                                                                                                                         israeli bosnian
        building
                               country
                               nation                                          america germany
                                                                                         france
                                                                                   japan
                                                                                       russia      israel                  russian
                                                        co
                                                         corp                    mexico               jerusalem    cuban
                                                                                      cuba china                          french
                                                                                                                      chinesebritish
                        village                                           calif                beijing                japanese
                   islandtown district                                                     taiwanmoscow sarajevo      german  american
                            city
                                   county
                        capital state                                                   abc                                   u.s.
                                                  lottery                                             tokyo                    u.s
         border                                                      congress
                                                                  parliament
                                                                                                      london
                                 federal                                                         washington
                  government                                                         texas         chicago
                                                                                                 new_york
                                                                                                     los_angeles
                                                                                                  boston
                  military
                     armyira                       house                               florida
                                                                                         california
            security                               senate                             arizona
      development                                          white_house
                                                  congressional                      new_hampshire
                                                                                             ohio

A fragment of a t-SNE embedding of the feature vectors (learned by
an LBL model) of the most frequent 1000 words.

                                                                           23
   t-SNE embedding of LBL feature vectors (II)

                                        automatic_rifles
                                        machetes
                                                                                industry_analysts retired_persons
                                                                                promoters
                                                                                                     american_states        factional
                                                 assassins          castes operatives                                                  fijian
                                                                                                                                    ghanaian
                                   gunboats
                                    gunners mobsters   intelligence_agents                                                          burundian
                                                         cultists
                                                    informants
                                      artillery_fire                            trustees                                             romanian
                                           mobs                    africans
                                                                         keepers          finance_ministers
                                             rioters          kashmiris
                                                                   dominicans            heads_of_state
                                                     mutineers brazilians
                                                          egyptians                 top_executives
                    co−defendants                        nigerianshindus                   fellows
                                                                                oil_companies
                                                          westerners       school_children                         rapist
                               heterosexuals state_troopers                                                        war_criminal
                      felonies
                            juveniles                                         contestants
                                                                  american_indians
                                                     passers−by
                       misdemeanors
                                    widows     gunshots natives californians   bidders
                                  young_girls              shopkeepers               dailies pastors
                                                                           ticket−holders
                                 unwed                         loggers
                                                          hikers
                                                       campers                        actresses
                                                         taxis       ranching                                                 facet a_blaze
                                                 climbers                                of_all_time
                                                                                cooks                       popular_front                  pearl_harbor gibraltar
                                                                                   wines workshops                                     mount_everestmongolia
                                                                                                 venues
                                                                                             presentations
                                                                                                    shrines                           everest tasmania
                                              alligators
                                                     blazes                            jumps          holy_city
                                   newborns                mobile_homes                                                                          lake_victoria
                                                                                                                                                       sarawak
                     foul_play                   gulls
                                            mosquitoes refineries                            reruns                       nagasaki                 yunnan
                                             pests
                                      primates                                                         hajj
                                                  bats           traffic_jams
                                                                          continents
                                                                                                                                      cape_cod
                                            bees
                                       mammals    butterflies              worlds
                                                          sport_utility_vehicles songwriters
                                                                                  junkies                                           times_square
                                        fossils dinosaursranches                                                government_bond mornings
                                                                                                                                   west_end
                                                                               rhythms beginnings              rand
           politicking       ancestryantibodies                                                  homelands                                  silicon_valley
                                                                                hymns                                              natural_history
                      adoptions                    monsters bicycles
                                                             pickupsrooftops                                            opec                   siberian
                                                                         doorstep
                      curfews                    bites
                                              guitars                                    cable_systems best_selling
                                                              stalls                       telephone_service                             cosmopolitan
                                                                                                                                            state−controlled
           leniency                                    irons decks   hot_spots                                              market_research mamba
                                                                                                                      magellan
                      invitationspostage                    joints                                       placebo                         vanguard
                                                        mats              antennas                         pacemaker
                                                                                                                                   public_security

A fragment of a t-SNE embedding of the feature vectors (learned by
an LBL model) of the least frequent 1000 words.

                                                                        24
  t-SNE embedding of HLBL feature vectors (I)
                                                       ira       corp                           election race
                                                                                      association
                                                                           organization          campaign
                                                                                                                               majority_leader
                                                                    bank
                      economic                                     company union
                                                            industry center
                                                        white_house                                      term
                       financial           press public administration
                                                                   agencychurch                                                                someoneking
                                                               boardoffice party     opposition                                                     man      green
                                                        defense hospital
                                                      government            justice                                    percent          family person
      international                                                                                                                                 boy       smith
                                     media                 ministry
                                                   military department                                               billion                       woman
                                 news_agency
                                        news
                                    newspaper        army             schoolcouncil
                                                                                   parliamentjury    congress million                        child girl       johnson
                                     magazine u.n             university commission
         palestinian                                   nato                                 court                                               friend
                                                                              committeesupreme_court #r
                                                                                              judge                                candidate
                                                                                                                                      governor fathermother
               rebel          radio posttimes                        available  x                            #rn                                  sonhusband
                                                                                                                                                    daughter
                             television
                             tv                                                                                               member             brother
                                                                                                                                                     wife     simpson
             muslim
              islamic                                     village                                                                 leader
                                                                                                                                     prime_minister
                                                                                                                                   president
                                               local     town                               calif                            chairman
                                                                                                                               director
                                                         city         island                                                       minister
                                                   state capital region                                gamenumbers                  gen
                                               federal  district              world history          lottery
                                     abc                                   country
                                                  county                nation  america
                                                                                 europe sarajevo                                              lawyer
                                                                                                                                              attorney
                                                                             mexico bosnia
                  presidential                california                       japan                             age
               gop congressional                                      united_states
              republican
            democratic senate
                          house            new_hampshire united_nations
                                              florida
                                             arizona
                                                 ohio
                                                                                germany
                                                                                 russia
                                                                             britainisrael
                                                                              france                   dollar
                                            iowa
                                               texas                     iraq
                                                                                 cuba                     fall                                       spokesman
                                                                                                                                                     spokeswoman
                                                                            china jordan
                                                                          beijing                          summer
                                                                  jerusalem   taiwan                                  hour hours
                                                                 moscow
                                                              washington                                        day          days
                                                                                                             night            months
                                                                                                                                years
                                                        los_angeles
                                                         new_york tokyo
                                                            chicago                                         morning year weeks
                                                                                                                    month
                                                                london                                           week
                                                                                                           weekend
                                                        boston
                                                                                                                today
                                                                                                                 tuesday
                                                                                                               thursday
                                                                                                                    wednesday
                                                                                                                  friday
                                   north
                                  south                                                                        mondaysaturday
                                 west
                               east                                                                              sunday


                                                                                            #n
                                                                                              november
                                                                                                january
                                                                                                 february
                                                                                               december

                                                                                               july
                                                                                                  march
                                                                                              june
                                                                                                 april
                                                                                                   jan


A fragment of a t-SNE embedding of the feature vectors (learned by
an HLBL model) of the most frequent 1000 words.

                                                                               25
  t-SNE embedding of HLBL feature vectors (II)
            warmly
        informants                             vendor                                oviedo
                                  anthropologist                                                         nigerians
                                                          narrator
                                            assemblyman spectator                gurney       devi             nance
                                                                                                         bunting
       co−defendants    actresses       cartoonist
                               co−stars         government_minister
                                                                     refinancing           simeon auction_househeller
                                                                                          newcomb
            artillery_fire pal                        regulator                                  bender            forte thomases
                                                                                jr.                                                     vaughan
           search_warrant
      decorations foul_play
                                       quartet
                           music_director                      luster
                                                            acquaintance                harlow gillespieleigh
                                                                                                    ortega
                                           investment_firm              rothschild           calder     lu
      vocabulary                                         hooker          goldsmith
               relapse physicist   peacock bangui
                                                                                               steele juan_carlosjimenez sheridan
                                                                                                                   jonah        marian          navy_secretary
                             mount_everest pharmaceuticals mansfield weathersconnolly
                     toilet_paper                       pocket_books                               ohs                              mead     urus
         sociology   swan observance                      los_alamos                  bach                   prescott bourgeois brent
            buttocks                                                    nirvana cezanne grey            crow                feather      choreographer
                                                                                picasso marx                                                anointed
     guest_house           lapse bites hobart
                                        assassins
                                                                                             welsh weston    anaheim          luke
      football_team times_square                          galileo                                                                                            valor
           excommunication                                        zodiac                                                         georges
                                               wills         browning             united_methodist_churchwahoo
                                                                                                 calgary
                                national_park_service moose
                                            hazy                                  monaco
                                                                 harriman rocker cousteau               panama_city
                                                                                                              cape_cod
                                                                                                                                 marina
                                                                                                                                 chad
                    slacks                             brandenburg                                                      lucy
                                                                                                                   regina                 toby
                                                                          cannes keepers dickinson ogden                                    aristotle
                                                                                                                                                mv
                             marge          magellan
                   saxophone              brewery              opec                           duff                    zion
                          perfume motowncaterpillar leisure
                                                  silicon_valley            nile                          bradford        cosmopolitan
                             scripture                        ranching                          lindsay                      major_league
             painfully           bayer        avon                            benton              uptown  z
                                                                                        punk       deceased
                      songwriting
                         hype                          market_research akron
    memorial_day irregular                                      placebo
                                            communications_equipment catfish
                                    west_end                                        rush−hour                                                     ebola_virus
                                         beverage                          mc           port_arthur overboard                 bronx
                                                                                                                     natural_history
                                                                                                                                 mamba
                            copying
                               special_effects sausage
             quarantine                  methamphetamine      rundown
           whaling          myriad soda
                                                     sporting_goodslaureate            public_security pinky
                royalty                      pines
                                   electric_power             best_selling
                       vitamin_e                                           forty−nine
      subversive                      margarine                                            borderline
                                packaged maxi
                                         charcoal                 commemorative
                                                                    flashy
           business_newsgps                  onion sleek               avidrhone
                                                                                                             ministerial
                                                                                                                commencement
                                     dinosaur walnut                                                                                            richard_rodgers
                                                                                                                                                marquis
                       fossil                                                              special_education
                                                                                      latino
                     yeast                         olive                  cajun
             flirting_with rottencoral
                              weed
                                                     archaeological
                                                       porn
                        kosher walled                       state−controlled
                                               erotic              romanian
                                   transplanted                    ghanaian                emerald mahogany
                                                                                               gilt
                                                                                        simulated        porcelain
              pierced
                   oval
                              ovarian
          victorious                      outpatient wailing
                                      ecological
                                  intestinal
                              emphysema
                                                                                  tenant
              triumphant sixty−two rabies
                                                                food_stamp
                            hopeless


A fragment of a t-SNE embedding of the feature vectors (learned by
an HLBL model) of the least frequent 1000 words.

                                                                                   26

				
DOCUMENT INFO