# LEARNING DISTRIBUTED REPRESENTATIONS FOR STATISTICAL LANGUAGE

Document Sample

```					L EARNING DISTRIBUTED REPRESENTATIONS
FOR STATISTICAL LANGUAGE MODELLING

1
Overview

1. Discrete data and distributed representations

2. Language modelling
• Factored RBM language model
• Log-bilinear language model
• Hierarchical log-bilinear language model

2
Discrete data

• Discrete data: datapoints with discrete-valued attributes

• When such datapoints are high-dimensional, regression /
classiﬁcation / density estimation is hard:
– Amounts to estimating entries of an exponentially large
table
- Attributes correspond to table dimensions
- Attribute values correspond to indices for the dimensions
– Data sparsity: little or no data available for most entries
– No a priori smoothness constraint on table entries
– No general way to generalize to new table entries

3
Distributed representations
• Observation: making a model less local often improves
generalization.
– In a continuous space: average over datapoints near the point
of interest.
– In a discrete space: not clear what to average over.
- What does “near” mean?
- No general concept of distance / neighbourhood.

• Working with smooth functions over continuous spaces
results in automatic smoothing.
– Similar inputs produce similar outputs

• Idea: map discrete attributes to real-valued vectors and
learn a smooth function that maps the vectors to the
desired output values.
– Learn the attribute mapping jointly with the function.
– Automatic generalization!
4
Statistical language modelling

• Goal: Model the joint distribution of words in a sentence.

• Such a model can be used to
– predict the next word given several preceding ones
– arrange bags of words into sentences
– assign probabilities to documents

• Applications: speech recognition, machine translation,
information retrieval.

• Most statistical language models are based on the Markov
assumption:
– The distribution of the next word depends on only n words that
immediately precede it.
– This assumption is clearly wrong but useful – it makes the task
much more tractable.

5
n-gram    models
• n-gram models are simply conditional probability tables
for P (wn|w1:n−1).
– wn is the word to be predicted (the next word)
– words w1:n−1 = w1, ..., wn−1 are called the context

• n-gram models are estimated by counting the number of
occurrences of each possible word n-tuple and
normalizing.
– smoothing the estimates is essential for good performance
– many different smoothing methods exist

• n-gram models are the most widely used statistical
language models due to their simplicity and excellent
performance.

• Curse of dimensionality: the number of model parameters
is exponential in n.
6
Neural language models

• Several neural probabilistic language models based on
distributed representations have been proposed.

• Common approach:
– Represent each word with a real-valued feature vector
– Represent the context by the sequence of the context word
feature vectors
– Train a neural network to output the distribution for the next
word from the context representation
– Learn word feature vectors jointly with other neural net
parameters

• Neural language models can outperform n-gram language
models, especially when little training data is available.

• Main drawback: very long training and testing times.

7
Conditional RBM language model

• Use a restricted Boltzmann machine to model P (wn|w1:n−1)
– Capture the interaction between wn and w1:n−1 through a vector
of latent variables.
– Represent words using low-dimensional real-valued vectors.
- Rw is the feature vector for word w.

• Energy function:
n
E(wn, h; w1:n−1) = −           Rwi Wih
i=1
– h is the vector of latent variables
– Wi is the interaction matrix between the feature vector for wi
and the latent variables.
– Normalization is done only over wn.

• Both inference and prediction take time linear in the
number of latent variables.
8
Log-bilinear model

• The log-bilinear (LBL) model is perhaps the simplest
neural language model.

• Given the context w1:n−1, the LBL model ﬁrst predicts the
representation for the next word wn by linearly combining
the representations of the context words:
n−1
ˆ
r=         Cirwi
i=1
– rw is the real-valued vector representing word w
• Then the distribution for the next word is computed based
on the similarity between the predicted representation and
the representations of all words in the vocabulary:
exp(ˆT rw )
r
P (wn = w|w1:n−1) =               Tr )
.
j exp(ˆ j
r

9
Faster models through structured vocabulary

• Computing the probability of the given next word
requires considering all N words in the vocabulary.
– Need to consider all words because the word space is
unstructured.

• Idea: Organize words in the vocabulary into a binary tree
and exploit its structure to speed up normalization (Morin
and Bengio, 2005).
– Construct a binary tree over words
- words are associated with leaf nodes
- one word per leaf
– Replace the N -way decision by a sequence of O(log N ) binary
decisions for predicting the next word.
- Can achieve an exponential speedup if the tree is balanced!

10
Tree-based factorization

• To deﬁne a distribution over leaf nodes:
– Specify the probability of taking the left branch at each non-leaf
node.
– The probability of a leaf node is the product of probabilities of
the left/right decisions that lead from the root node to the leaf
node.

11
Constructing trees over words
• The approach of Morin and Bengio:
– Manually select one parent node per word
– Use clustering to make the resulting tree binary
– Use the Neural Probabilistic Language Model for making the
left/right decisions

• Drawbacks:
– Tree construction process uses expert knowledge
– The resulting model does not work as well as its
non-hierarchical counterpart

• Our approach:
– Construct the word tree from data alone (no experts needed)
– Allow each word to occur more than once in the tree
– Use the simpliﬁed log-bilinear language model for making the
left/right decisions
12
Hierarchical log-bilinear model

• Let d be the binary code that encodes the sequence of
left-right decisions in the tree that lead to word w.

• Each non-leaf node in the tree is given a feature vector.
– Used for discriminating the words in the left subtree from those
in the right subtree.

• The probability of taking the left branch at ith node in the
sequence is
P (di = 1|qi, w1:n−1) = σ(ˆT qi),
r
ˆ
– r is computed as in the LBL model
– qi is the feature vector for the node

• The probability of w being the next word is
P (wn = w|w1:n−1) =          P (di|qi, w1:n−1).
i

13
Data-driven tree construction

• We would like to cluster words based on the distribution
of contexts in which they occur.

• This distribution is hard to estimate and work with due to
the high dimensionality of the space of contexts.
– same difﬁculties as with estimating n-gram models

• To avoid this problem, we represent contexts using
distributed representations and cluster words based on
their expected predicted representation.

• Constructing a tree over words:
1. Train a model using a (balanced) random tree over words.
2. Extract the word representations from the trained model.
3. Perform hierarchical clustering on the extracted
representations.

14
Hierarchical clustering

• Hierarchical top-down clustering of feature vectors:
– At each level, ﬁt a mixture of two Gaussians with spherical
covariances using EM to the current group of word
representations.
– Assign words to mixture components based on the component
responsibilities.

• We considered several splitting rules:
– BALANCED: Sort the responsibilities and make the split to
ensure a balanced tree.
– ADAPTIVE: Assign the word to the component with the
greater responsibility.
– ADAPTIVE(ǫ): Assign the word to a component if its
responsibility for the word is at least 0.5-ǫ.

15
Dataset and evaluation

• APNews dataset:
– collection of Associated Press news stories (16 million words)

• Preprocessing (Bengio et al.):
– convert all words to lower case
– map all rare words and proper nouns to special symbols
– just under 18000 words in the vocabulary

• Models were compared based on the perplexity they
assigned to the test set.
1
• Perplexity is the geometric average of     P (wn|w1:n−1) .

16
Model evaluation (I)

• Preliminary comparison:
– 10M training set, 0.5M validation set, 0.5M test set
– Feature-based models have 100D feature vectors.
– FRBMs have 1000 hidden units.
– KNn is a Kneser-Ney back-off n-gram model.

Model      Context Model test Mixture test
type       size   perplexity perplexity
FRBM          2         169.4       110.6
Temporal FRBM     2         127.3        95.6
Log-bilinear    2         132.9       102.2
Log-bilinear    5         124.7        96.5
Back-off GT3     2         135.3            –
Back-off KN3     2         124.3            –
Back-off GT6     5         124.4            –
Back-off KN6     5         116.2            –

17
Model evaluation (II)

• Final comparison:
– 14M training set, 1M validation set, 1M test set
– (H)LBL used 100D feature vectors and a context size of 5.
– KNn is an interpolated Kneser-Ney n-gram model.

Model   Tree generating       Test Mixture Fitted mix. Minutes
type    algorithm         perplex. perplex. perplexity per epoch
HLBL    RANDOM               151.2    107.2      106.0         4
HLBL    BALANCED             131.3     99.9       99.7         4
HLBL    ADAPTIVE             127.0     98.3       98.2         4
HLBL    ADAPTIVE(0.25)       124.4     97.5       97.4         6
HLBL    ADAPTIVE(0.4)        123.3     97.2       97.1         7
HLBL    ADAPTIVE(0.4) × 2    115.7     95.3       95.3        16
HLBL    ADAPTIVE(0.4) × 4    112.1     94.4       94.3        32
LBL    –                    117.0     94.0       94.0      6420
KN2     –                    174.2        –          –         –
KN3     –                    125.6        –          –         –
KN6     –                    119.2        –          –         –

18
The effect of the context size
130
HLBL
KNn
125

120
Test test perplexity

115

110

105

100

95
2   4   6   8        10        12   14   16   18          20
Context size

• The HLBL models were based on the ADAPTIVE(0.4) × 4 tree.
• KNn is an interpolated modiﬁed Kneser-Ney n-gram model.
19
T HE E ND

20
Log-prob contributions: 5-gram vs. LBL (I)
5
x 10
3
KN5
LBL5

2.5

2
Number of predictions

1.5

1

0.5

0
1   2   3        4         5   6   7   8
Bin

Number of predictions (P (wn|w1:n−1)) on the test set as a function of
the their magnitude. Bin i (for i = 1, ..., 7) contains predictions
between 10−i and 10−i+1. Bin 8 contains predictions smaller than
10−7.
21
Log-prob contributions: 5-gram vs. LBL (II)
5
x 10
5
KN5
4.5                                                     LBL5

4

3.5
Sum of −log P

3

2.5

2

1.5

1

0.5

0
1   2   3        4         5   6   7   8
Bin

Contribution to the negative log-probability of the test set as a
function of the prediction magnitude. Bin i (for i = 1, ..., 7) contains
predictions between 10−i and 10−i+1. Bin 8 contains predictions
smaller than 10−7.
22
t-SNE embedding of LBL feature vectors (I)

official
sources
prosecutorsofficials
investigators
authorities                                            deputy
police                                       director
chairmanchief
member
members
president
prime_minister
governor gen
reporters                           king                             sen
rep
officers forces
employees troops
workers soldiers                                                                                     p.m.
a.m.
rebels
serbs
parents residents
familiespeoplevoters             candidate
children
americans         delegates
candidates
women                                            republicans
democrats
men
parties
sides
groups
schools            companies
states
homes                         countries
parts                  united_states
areas                           united_nations
u.n nato                                                        islamic
muslim
serb
bosnia                              palestinian
region world
area                                                                      iraq
europe britain                             rebel
israeli bosnian
building
country
nation                                          america germany
france
japan
russia      israel                  russian
co
corp                    mexico               jerusalem    cuban
cuba china                          french
chinesebritish
village                                           calif                beijing                japanese
islandtown district                                                     taiwanmoscow sarajevo      german  american
city
county
capital state                                                   abc                                   u.s.
lottery                                             tokyo                    u.s
border                                                      congress
parliament
london
federal                                                         washington
government                                                         texas         chicago
new_york
los_angeles
boston
military
armyira                       house                               florida
california
security                               senate                             arizona
development                                          white_house
congressional                      new_hampshire
ohio

A fragment of a t-SNE embedding of the feature vectors (learned by
an LBL model) of the most frequent 1000 words.

23
t-SNE embedding of LBL feature vectors (II)

automatic_rifles
machetes
industry_analysts retired_persons
promoters
american_states        factional
assassins          castes operatives                                                  fijian
ghanaian
gunboats
gunners mobsters   intelligence_agents                                                          burundian
cultists
informants
artillery_fire                            trustees                                             romanian
mobs                    africans
keepers          finance_ministers
rioters          kashmiris
mutineers brazilians
egyptians                 top_executives
co−defendants                        nigerianshindus                   fellows
oil_companies
westerners       school_children                         rapist
heterosexuals state_troopers                                                        war_criminal
felonies
juveniles                                         contestants
american_indians
passers−by
misdemeanors
widows     gunshots natives californians   bidders
young_girls              shopkeepers               dailies pastors
ticket−holders
unwed                         loggers
hikers
campers                        actresses
taxis       ranching                                                 facet a_blaze
climbers                                of_all_time
cooks                       popular_front                  pearl_harbor gibraltar
wines workshops                                     mount_everestmongolia
venues
presentations
shrines                           everest tasmania
alligators
blazes                            jumps          holy_city
newborns                mobile_homes                                                                          lake_victoria
sarawak
foul_play                   gulls
mosquitoes refineries                            reruns                       nagasaki                 yunnan
pests
primates                                                         hajj
bats           traffic_jams
continents
cape_cod
bees
mammals    butterflies              worlds
sport_utility_vehicles songwriters
junkies                                           times_square
fossils dinosaursranches                                                government_bond mornings
west_end
rhythms beginnings              rand
politicking       ancestryantibodies                                                  homelands                                  silicon_valley
hymns                                              natural_history
pickupsrooftops                                            opec                   siberian
doorstep
curfews                    bites
guitars                                    cable_systems best_selling
stalls                       telephone_service                             cosmopolitan
state−controlled
leniency                                    irons decks   hot_spots                                              market_research mamba
magellan
invitationspostage                    joints                                       placebo                         vanguard
mats              antennas                         pacemaker
public_security

A fragment of a t-SNE embedding of the feature vectors (learned by
an LBL model) of the least frequent 1000 words.

24
t-SNE embedding of HLBL feature vectors (I)
ira       corp                           election race
association
organization          campaign
bank
economic                                     company union
industry center
white_house                                      term
agencychurch                                                                someoneking
boardoffice party     opposition                                                     man      green
defense hospital
government            justice                                    percent          family person
international                                                                                                                                 boy       smith
media                 ministry
military department                                               billion                       woman
news_agency
news
newspaper        army             schoolcouncil
parliamentjury    congress million                        child girl       johnson
magazine u.n             university commission
palestinian                                   nato                                 court                                               friend
committeesupreme_court #r
judge                                candidate
governor fathermother
rebel          radio posttimes                        available  x                            #rn                                  sonhusband
daughter
television
tv                                                                                               member             brother
wife     simpson
muslim
prime_minister
president
local     town                               calif                            chairman
director
city         island                                                       minister
federal  district              world history          lottery
abc                                   country
county                nation  america
europe sarajevo                                              lawyer
attorney
mexico bosnia
presidential                california                       japan                             age
gop congressional                                      united_states
republican
democratic senate
house            new_hampshire united_nations
florida
arizona
ohio
germany
russia
britainisrael
france                   dollar
iowa
texas                     iraq
cuba                     fall                                       spokesman
spokeswoman
china jordan
beijing                          summer
jerusalem   taiwan                                  hour hours
moscow
washington                                        day          days
night            months
years
los_angeles
new_york tokyo
chicago                                         morning year weeks
month
london                                           week
weekend
boston
today
tuesday
thursday
wednesday
friday
north
south                                                                        mondaysaturday
west
east                                                                              sunday

#n
november
january
february
december

july
march
june
april
jan

A fragment of a t-SNE embedding of the feature vectors (learned by
an HLBL model) of the most frequent 1000 words.

25
t-SNE embedding of HLBL feature vectors (II)
warmly
informants                             vendor                                oviedo
anthropologist                                                         nigerians
narrator
assemblyman spectator                gurney       devi             nance
bunting
co−defendants    actresses       cartoonist
co−stars         government_minister
refinancing           simeon auction_househeller
newcomb
artillery_fire pal                        regulator                                  bender            forte thomases
jr.                                                     vaughan
search_warrant
decorations foul_play
quartet
music_director                      luster
acquaintance                harlow gillespieleigh
ortega
investment_firm              rothschild           calder     lu
vocabulary                                         hooker          goldsmith
relapse physicist   peacock bangui
steele juan_carlosjimenez sheridan
jonah        marian          navy_secretary
mount_everest pharmaceuticals mansfield weathersconnolly
sociology   swan observance                      los_alamos                  bach                   prescott bourgeois brent
buttocks                                                    nirvana cezanne grey            crow                feather      choreographer
picasso marx                                                anointed
guest_house           lapse bites hobart
assassins
welsh weston    anaheim          luke
football_team times_square                          galileo                                                                                            valor
excommunication                                        zodiac                                                         georges
wills         browning             united_methodist_churchwahoo
calgary
national_park_service moose
hazy                                  monaco
harriman rocker cousteau               panama_city
cape_cod
marina
slacks                             brandenburg                                                      lucy
regina                 toby
cannes keepers dickinson ogden                                    aristotle
mv
marge          magellan
saxophone              brewery              opec                           duff                    zion
perfume motowncaterpillar leisure
scripture                        ranching                          lindsay                      major_league
painfully           bayer        avon                            benton              uptown  z
punk       deceased
songwriting
hype                          market_research akron
memorial_day irregular                                      placebo
communications_equipment catfish
west_end                                        rush−hour                                                     ebola_virus
beverage                          mc           port_arthur overboard                 bronx
natural_history
mamba
copying
special_effects sausage
quarantine                  methamphetamine      rundown
sporting_goodslaureate            public_security pinky
royalty                      pines
electric_power             best_selling
vitamin_e                                           forty−nine
subversive                      margarine                                            borderline
packaged maxi
charcoal                 commemorative
flashy
ministerial
commencement
dinosaur walnut                                                                                            richard_rodgers
marquis
fossil                                                              special_education
latino
yeast                         olive                  cajun
flirting_with rottencoral
weed
archaeological
porn
kosher walled                       state−controlled
erotic              romanian
transplanted                    ghanaian                emerald mahogany
gilt
simulated        porcelain
pierced
oval
ovarian
victorious                      outpatient wailing
ecological
intestinal
emphysema
tenant
triumphant sixty−two rabies
food_stamp
hopeless

A fragment of a t-SNE embedding of the feature vectors (learned by
an HLBL model) of the least frequent 1000 words.

26

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 13 posted: 5/27/2011 language: English pages: 26