# INTRODUCTION TO ARTIFICIAL INTELLIGENCE Rally

Document Sample

```					INTRODUCTION TO ARTIFICIAL
INTELLIGENCE
Massimo Poesio

LECTURE 16: Unsupervised methods, IR,
and lexical acquisition
FEATURE-DEFINED DATA SPACE

. ..
. .        . .. .
. .    .
. . .
.
2
UNSUPERVISED MACHINE LEARNING
• In many cases, what we want to learn is not a
target function from examples to classes, but
what the classes are
– I.e., learn without being told
EXAMPLE: TEXT CLASSIFICATION
• Consider clustering a large set of computer
science documents

Arch.
Graphics

Theory
NLP
AI
CLUSTERING
• Partition unlabeled examples into disjoint
subsets of clusters, such that:
– Examples within a cluster are very similar
– Examples in different clusters are very different
• Discover new categories in an unsupervised
manner (no sample category labels
provided).

6
Deciding what a new doc is about
• Check which region the new doc falls into
– can output “softer” decisions as well.

Arch.
Graphics

Theory
NLP
AI

= AI
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of unlabeled examples.
animal

vertebrate               invertebrate

fish reptile amphib. mammal   worm insect crustacean

• Recursive application of a standard clustering
algorithm can produce a hierarchical clustering.

8
Agglomerative vs. Divisive Clustering
• Agglomerative (bottom-up) methods start
with each example in its own cluster and
iteratively combine them to form larger and
larger clusters.
• Divisive (partitional, top-down) separate all
examples immediately into clusters.

9
Direct Clustering Method
• Direct clustering methods require a
specification of the number of clusters, k,
desired.
• A clustering evaluation function assigns a real-
value quality measure to a clustering.
• The number of clusters can be determined
automatically by explicitly generating
clusterings for multiple values of k and
choosing the best result according to a
clustering evaluation function.

10
Hierarchical Agglomerative Clustering
(HAC)

• Assumes a similarity function for determining
the similarity of two instances.
• Starts with all instances in a separate cluster
and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
• The history of merging forms a binary tree or
hierarchy.

11
Cluster Similarity
• Assume a similarity function that determines the
similarity of two instances: sim(x,y).
– Cosine similarity of document vectors.
• How to compute similarity of two clusters each
possibly containing multiple instances?
– Single Link: Similarity of two most similar members.
– Complete Link: Similarity of two least similar members.
– Group Average: Average similarity between members.

12
Non-Hierarchical Clustering
• Typically must provide the number of desired
clusters, k.
• Randomly choose k instances as seeds, one per
cluster.
• Form initial clusters based on these seeds.
• Iterate, repeatedly reallocating instances to different
clusters to improve the overall clustering.
• Stop when clustering converges or after a fixed
number of iterations.

13
CLUSTERING METHODS IN NLP
• Unsupervised techniques are heavily used in :
– Text classification
– Information retrieval
– Lexical acquisition
CLUSTERING METHODS IN NLP
• Unsupervised techniques are heavily used in :
– Text classification
– Information retrieval
– Lexical acquisition
Feature-based lexical semantics

• Very old idea in lexical semantics: the meaning
of a word can be specified in terms of the
values of certain `features’
(`DECOMPOSITIONAL SEMANTICS’)
– dog : ANIMATE= +, EAT=MEAT, SOCIAL=+
– horse : ANIMATE= +, EAT=GRASS, SOCIAL=+
– cat : ANIMATE= +, EAT=MEAT, SOCIAL=-

2004/05                    ANLE                   16
FEATURE-BASED REPRESENTATIONS IN
PSYCHOLOGY

• Feature-based concept representations assumed by many
cognitive psychology theories (Smith and Medin, 1981, McRae
et al, 1997)
• Underpin development of prototype theory (Rosch et al)
• Used, e.g., to account for semantic priming (McRae et al,
1997; Plaut, 1995)
• Underlie much work on category-specific defects (Warrington
and Shallice, 1984; Caramazza and Shelton, 1998; Tyler et al,
2000; Vinson and Vigliocco, 2004)

Feb 21st                 Cog/Comp Neuroscience               17
SPEAKER-GENERATED FEATURES (VINSON AND
VIGLIOCCO)

Feb 21st        Cog/Comp Neuroscience   18
Vector-based lexical semantics

• If we think of the features as DIMENSIONS we
can view these meanings as VECTORS in a
FEATURE SPACE
– (An idea introduced by Salton in Information
Retrieval, see below)

2004/05                      ANLE                      19
Vector-based lexical semantics
DO   CAT
G

HORSE

2004/05                 ANLE                20
General characterization of vector-
based semantics (from Charniak)
• Vectors as models of concepts
• The CLUSTERING approach to lexical semantics:
1. Define properties one cares about, and give values to each
property (generally, numerical)
2. Create a vector of length n for each item to be classified
3. Viewing the n-dimensional vector as a point in n-space,
cluster points that are near one another
• What changes between models:
1. The properties used in the vector
2. The distance metric used to decide if two points are
`close’
3. The algorithm used to cluster

2004/05                          ANLE                            21
Using words as features in a vector-
based semantics
•    The old decompositional semantics approach requires
i.    Specifying the features
ii.   Characterizing the value of these features for each lexeme
•    Simpler approach: use as features the WORDS that occur in the proximity of that
word / lexical entry
–     Intuition: “You can tell a word’s meaning from the company it keeps”
•    More specifically, you can use as `values’ of these features
–     The FREQUENCIES with which these words occur near the words whose meaning we
are defining
–     Or perhaps the PROBABILITIES that these words occur next to each other
•    Alternative: use the DOCUMENTS in which these words occur (e.g., LSA)
•    Some psychological results support this view. Lund, Burgess, et al (1995, 1997):
lexical associations learned this way correlate very well with priming experiments.
Landauer et al: good correlation on a variety of topics, including human
categorization & vocabulary tests.

2004/05                                         ANLE                                       22
Using neighboring words to specify
lexical meanings
Learning the meaning of DOG from
text
Learning the meaning of DOG from
text
Learning the meaning of DOG from
text
Learning the meaning of DOG from
text
Learning the meaning of DOG from
text
The lexicon we acquire
Meanings in word space
Acquiring lexical vectors from a corpus
(Schuetze, 1991; Burgess and Lund, 1997)
• To construct vectors C(w) for each word w:
1. Scan a text
2. Whenever a word w is encountered, increment all cells of
C(w) corresponding to the words v that occur in the
vicinity of w, typically within a window of fixed size
• Differences among methods:
– Size of window
– Weighted or not
– Whether every word in the vocabulary counts as a
dimension (including function words such as the or and)
or whether instead only some specially chosen words are
used (typically, the m most common content words in the
corpus; or perhaps modifiers only). The words chosen as
dimensions are often called CONTEXT WORDS
– Whether dimensionality reduction methods are applied
2004/05                         ANLE                            32
Variant: using probabilities (e.g.,
Dagan et al, 1997)

• E.g., for
house
• Context vector (using probabilities)
–   0.001394 0.016212 0.003169 0.000734 0.001460 0.002901 0.004725 0.000598 0
0 0.008993 0.008322 0.000164 0.010771 0.012098 0.002799 0.002064 0.007697
0 0 0.001693 0.000624 0.001624 0.000458 0.002449 0.002732 0 0.008483
0.007929 0 0.001101 0.001806 0 0.005537 0.000726 0.011563 0.010487 0
0.001809 0.010601 0.000348 0.000759 0.000807 0.000302 0.002331 0.002715
0.020845 0.000860 0.000497 0.002317 0.003938 0.001505 0.035262 0.002090
0.004811 0.001248 0.000920 0.001164 0.003577 0.001337 0.000259 0.002470
0.001793 0.003582 0.005228 0.008356 0.005771 0.001810 0 0.001127 0.001225
0 0.008904 0.001544 0.003223 0

2004/05                                      ANLE                                         33
Variant: using modifiers to specify the
meaning of words
• …. The Soviet cosmonaut …. The American astronaut
…. The red American car …. The old red truck … the
spacewalking cosmonaut … the full Moon …

cosmonaut   astronaut   moon car   truck
Soviet       1           0           0    1     1
American     0           1           0    1     1
spacewalking 1           1           0    0     0
red          0           0           0    1     1

full         0           0           1    0     0

old          0           0           0    1     1
2004/05                                ANLE                       34
Another variant:
word / document matrices

d1   d2          d3   d4   d5   d6

cosmonaut    1    0           1    0    0    0

astronaut    0    1           0    0    0    0

moon         1    1           0    0    0    0

car          1    0           0    1    1    0

truck        0    0           0    1    0    1

2004/05                          ANLE                       35
Measures of semantic similarity
• Euclidean distance:    d     i 1
n
x i  yi 2


n
• Cosine:
x yi

cos( )              i 1 i

             i 1
n    2          n
x
i 1 i
yi2

d  i 1 xi  yi
n
• Manhattan Metric:

2004/05                       ANLE                             36
SIMILARITY IN VECTOR SPACE MODELS: THE
COSINE MEASURE

d j * qk
cos 
dj                                        d j qk
θ
qk                             N

                w               k ,i   w j ,i
sim qk , d j  
          
i 1

               i 1 w                    i 1 j
N          2                  N
k ,i               w 2,i
EVALUATION
• Synonymy identification
• Text coherence
• Semantic priming
SYNONYMY: THE TOEFL TEST
TOEFL TEST: RESULTS
Some psychological evidence for
vector-space representations
• Burgess and Lund (1996, 1997): the clusters found with HAL
correlate well with those observed using semantic priming
experiments.
• Landauer, Foltz, and Laham (1997): scores overlap with
those of humans on standard vocabulary and topic tests;
mimic human scores on category judgments; etc.
• Evidence about `prototype theory’ (Rosch et al, 1976)
– Posner and Keel, 1968
• subjects presented with patterns of dots that had been obtained by
variations from single pattern (`prototype’)
• Later, they recalled prototypes better than samples they had actually
seen
– Rosch et al, 1976: `basic level’ categories (apple, orange, potato,
carrot) have higher `cue validity’ than elements higher in the
hierarchy (fruit, vegetable) or lower (red delicious, cox)

2004/05                                  ANLE                                        41
The HAL model
(Burgess and Lund, 1995, 1996, 1997)
• A 160 million words corpus of articles
extracted from all newsgroups containing
English dialogue
• Context words: the 70,000 most frequently
occurring symbols within the corpus
• Window size: 10 words to the left and the
right of the word
• Measure of similarity: cosine

2004/05               ANLE                    42
HAL AND SEMANTIC PRIMING
INFORMATION RETRIEVAL
• GOAL: Find the documents most relevant to a
certain QUERY
• Latest development: WEB SEARCH
– Use the Web as the collection of documents
• Related:
– DOCUMENT CLASSIFICATION
DOCUMENTS AS BAGS OF WORDS

INDEX
DOCUMENT
broad tech stock rally may signal   may
rallied
technology stocks rallied on        signal
tuesday, with gains scored          stock
amid what some traders called a     tech
recovery from recent doldrums.      technology
trend
THE VECTOR SPACE MODEL
• Query and documents represented as vectors
of index terms, assigned non-binary WEIGHTS
• Similarity calculated using vector algebra:
COSINE (cfr. lexical similarity models)
– RANKED similarity
• Most popular of all models (cfr. Salton and
Lesk’s SMART)
TERM WEIGHTING IN VECTOR SPACE MODELS:
THE TF.IDF MEASURE

N 
tfidfi ,k  f i ,k * log 
 df 
 i
FREQUENCY of term i
in document k              Number of documents
with term i
VECTOR-SPACE MODELS WITH
SYNTACTIC INFORMATION
• Pereira and Tishby, 1992: two words are similar if
they occur as objects of the same verbs
– John ate POPCORN
– John ate BANANAS
• C(w) is the distribution of verbs for which w
served as direct object.
– First approximation: just counts
– In fact: probabilities
• Similarity: RELATIVE ENTROPY
2004/05                        ANLE                49
(SYNTACTIC) RELATION-BASED
VECTOR MODELS
attacked     fox         dog
attacked                  <subj,fox>   <det,the>   <det,the>
subj                 obj
<obj,dog>    <mod,red>   <mod,lazy>
fox                      dog
det               mod    det           mod

the               red     the          lazy

E.g., Grefenstette, 1994; Lin, 1998; Curran and Moens, 2002

Feb 21st                          Cog/Comp Neuroscience                      50
SEXTANT (Grefenstette, 1992)
It was concluded that the carcinoembryonic antigens represent cellular
constituents which are repressed during the course of differentiation
the normal digestive system epithelium and reappear in the
corresponding malignant cells by a process of derepressive
dedifferentiation

antigen repress-DOBJ
antigen represent-SUBJ
constituent represent-DOBJ
course repress-IOBJ
……..

2004/05                              ANLE                                   51
SEXTANT: Similarity measure
DOG                                   CAT
dog pet-DOBJ                           cat pet-DOBJ
dog eat-SUBJ                           cat pet-DOBJ
dog leash-NN

CountAttributes shared by A and B
Jaccard:
CountUnique attributes possessed by A and B

Count{leash - NN, pet - DOBJ}                             2

Count{brown - ADJ, eat - SUBJ, hairy - ADJ, leash - NN, pet - DOBJ, shaggy - ADJ} 6

2004/05                                    ANLE                                         52
MULTIDIMENSIONAL SCALING
• Many models (included HAL) apply techniques
for REDUCING the number of dimensions
• Intuition: many features express a similar
property / topic
MULTIDIMENSIONAL SCALING
Latent Semantic Analysis (LSA)
(Landauer et al, 1997)
• Goal: extract relatons of expected contextual
usage from passages
• Two steps:
1. Build a word / document cooccurrence matrix
2. `Weigh’ each cell
3. Perform a DIMENSIONALITY REDUCTION
• Argued to correlate well with humans on a
number of tests

2004/05                     ANLE                       55
LSA: the method, 1

2004/05           ANLE         56
LSA: Singular Value Decomposition

2004/05         ANLE             57
LSA: Reconstructed matrix

2004/05              ANLE             58
Topic correlations in `raw’ and
`reconstructed’ data

2004/05                 ANLE                59
Some caveats
• Two senses of `similarity’
– Schuetze: two words are similar if one can replace the other
– Brown et al: two words are similar if they occur in similar
contexts
• What notion of `meaning’ is learned here?
– “One might consider LSA’s maximal knowledge of the world to
be analogous to a well-read nun’s knowledge of sex, a level of
knowledge often deemed a sufficient basis for advising the
young” (Landauer et al, 1997)
• Can one do semantics with these representations?
– Our own experience: using HAL-style vectors for resolving
bridging references
– Very limited success
– Applying dimensionality reduction didn’t seem to help

2004/05                            ANLE                                  60
REMAINING LECTURES
DAY             HOUR     TOPIC
Wed 25/11       12-14    Text classification with
Artificial Neural Nets
Tue 1/12        10-12    Lab: Supervised ML
with Weka
Fri 4/12        10-12    Unsupervised methods
& their application in
lexical acq and IR
Wed 9/12        10-12    Lexical acquisition by
clustering
Thu 10/12       10-12    Psychological evidence
on learning
Fri 11/12       10-12    Psychological evidence
on language processing
Mon 14/12       10-12    Intro to NLP
REMAINING LECTURES
DAY             HOUR     TOPIC
Tue 15/12       10-12    Machine learning for
anaphora
Tue 15/12       14-16    Lab: Clustering
Wed 16/12       14-16    Lab: BART
Thu 17/12       10-12    Ling. & psychological
evidence on anaphora
Fri 18/12       10-12    Corpora for anaphora
Mon 21/12       10-12    Lexical & commons.
knowledge for
anaphora
Tue 22/12       10-12    Salience
Tue 22/12       14-16    Discourse new
detection
ACKNOWLEDGMENTS
• Some of the slides come from
– Ray Mooney’s Utexas AI course
– Marco Baroni

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 32 posted: 12/27/2010 language: English pages: 60
Description: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Rally