Docstoc

Using lexical and relational similarity to classify semantic relations

Document Sample
Using lexical and relational similarity to classify semantic relations Powered By Docstoc
					     Using lexical and relational similarity to classify semantic relations
                           ´ e
               Diarmuid O S´ aghdha                             Ann Copestake
                Computer Laboratory                          Computer Laboratory
               University of Cambridge                      University of Cambridge
               15 JJ Thomson Avenue                         15 JJ Thomson Avenue
                Cambridge CB3 0FD                            Cambridge CB3 0FD
                  United Kingdom                               United Kingdom
              do242@cl.cam.ac.uk                           aac10@cl.cam.ac.uk

                    Abstract                         has received a great deal of attention in recent
                                                     years (Girju et al., 2005; Turney, 2006; But-
    Many methods are available for comput-           nariu and Veale, 2008). In English (and other
    ing semantic similarity between individ-         languages), the process of producing new lexical
    ual words, but certain NLP tasks require         items through compounding is very frequent and
    the comparison of word pairs. This pa-           very productive. Furthermore, the noun-noun re-
    per presents a kernel-based framework for        lation expressed by a given compound is not ex-
    application to relational reasoning tasks of     plicit in its surface form: a steel knife may be a
    this kind. The model presented here com-         knife made from steel but a kitchen knife is most
    bines information about two distinct types       likely to be a knife used in a kitchen, not a knife
    of word pair similarity: lexical similarity      made from a kitchen. The assumption made by
    and relational similarity. We present an         similarity-based interpretation methods is that the
    efficient and flexible technique for imple-        likely meaning of a novel compound can be pre-
    menting relational similarity and show the       dicted by comparing it to previously seen com-
    effectiveness of combining lexical and re-       pounds whose meanings are known. This is a
    lational models by demonstrating state-of-       natural framework for computational techniques;
    the-art results on a compound noun inter-        there is also empirical evidence for similarity-
    pretation task.                                  based interpretation in human compound process-
                                                     ing (Ryder, 1994; Devereux and Costello, 2007).
1   Introduction
                                                        This paper presents an approach to relational
The problem of modelling semantic similarity be-     reasoning based on combining information about
tween words has long attracted the interest of re-   two kinds of similarity between word pairs: lex-
searchers in Natural Language Processing and has     ical similarity and relational similarity. The as-
been shown to be important for numerous applica-     sumptions underlying these two models of similar-
tions. For some tasks, however, it is more appro-    ity are sketched in Section 2. In Section 3 we de-
priate to consider the problem of modelling sim-     scribe how these models can be implemented for
ilarity between pairs of words. This is the case     statistical machine learning with kernel methods.
when dealing with tasks involving relational or      We present a new flexible and efficient kernel-
analogical reasoning. In such tasks, the chal-       based framework for classification with relational
lenge is to compare pairs of words on the basis of   similarity. In Sections 4 and 5 we apply our
the semantic relation(s) holding between the mem-    methods to a compound interpretation task and
bers of each pair. For example, the noun pairs       demonstrate that combining models of lexical and
(steel,knife) and (paper,cup) are similar because    relational similarity can give state-of-the-art re-
in both cases the relation N2 is made of N1 fre-     sults on a compound noun interpretation task, sur-
quently holds between their members. Analogi-        passing the performance attained by either model
cal tasks are distinct from (but not unrelated to)   taken alone. We then discuss previous research
other kinds of “relation extraction” tasks where     on relational similarity, and show that some previ-
each data item is tied to a specific sentence con-    ously proposed models can be implemented in our
text (e.g., Girju et al. (2007)).                    framework as special cases. Given the good per-
   One such relational reasoning task is the prob-   formance achieved for compound interpretation, it
lem of compound noun interpretation, which           seems likely that the methods presented in this pa-
per can also be applied successfully to other rela-      hold between the members of each pair. A re-
tional reasoning tasks; we suggest some directions       lational distributional hypothesis therefore states
for future research in Section 7.                        that two word pairs are semantically similar if their
                                                         members appear together in similar contexts.
2   Two models of word pair similarity                      The distinction between lexical and relational
                                                         similarity for word pair comparison is recognised
While there is a long tradition of NLP research          by Turney (2006) (he calls the former attributional
on methods for calculating semantic similarity be-       similarity), though the methods he presents focus
tween words, calculating similarity between pairs                                   ´ e
                                                         on relational similarity. O S´ aghdha and Copes-
(or n-tuples) of words is a less well-understood         take’s (2007) classification of information sources
problem. In fact, the problem has rarely been            for noun compound interpretation also includes a
stated explicitly, though it is implicitly addressed     description of lexical and relational similarity. Ap-
by most work on compound noun interpretation             proaches to compound noun interpretation have
and semantic relation extraction. This section de-       tended to use either lexical or relational similarity,
scribes two complementary approaches for using           though rarely both (see Section 6 below).
distributional information extracted from corpora
to calculate noun pair similarity.                       3     Kernel methods for pair similarity
   The first model of pair similarity is based on
                                                         3.1    Kernel methods
standard methods for computing semantic similar-
ity between individual words. According to this          The kernel framework for machine learning is a
lexical similarity model, word pairs (w1 , w2 ) and      natural choice for similarity-based classification
(w3 , w4 ) are judged similar if w1 is similar to w3     (Shawe-Taylor and Cristianini, 2004). The cen-
and w2 is similar to w4 . Given a measure wsim           tral concept in this framework is the kernel func-
of word-word similarity, a measure of pair simi-         tion, which can be viewed as a measure of simi-
larity psim can be derived as a linear combination       larity between data items. Valid kernels must sat-
of pairwise lexical similarities:                        isfy the mathematical condition of positive semi-
                                                         definiteness; this is equivalent to requiring that the
     psim((w1 , w2 ), (w3 , w4 )) =               (1)    kernel function equate to an inner product in some
      α[wsim(w1 , w3 )] + β[wsim(w2 , w4 )]              vector space. The kernel can be expressed in terms
                                                         of a mapping function φ from the input space X to
A great number of methods for lexical semantic           a feature space F:
similarity have been proposed in the NLP liter-                     k(xi , xj ) = φ(xi ), φ(xj )           (2)
                                                                                                   F
ature. The most common paradigm for corpus-
based methods, and the one adopted here, is based        where ·, · F is the inner product associated with
on the distributional hypothesis: that two words         F. X and F need not have the same dimension-
are semantically similar if they have similar pat-       ality or be of the same type. F is by definition an
terns of co-occurrence with other words in some          inner product space, but the elements of X need
set of contexts. Curran (2004) gives a comprehen-        not even be vectorial, so long as a suitable map-
sive overview of distributional methods.                 ping function φ can be found. Furthermore, it is
   The second model of pair similarity rests on the      often possible to calculate kernel values without
assumption that when the members of a word pair          explicitly representing the elements of F; this al-
are mentioned in the same context, that context          lows the use of implicit feature spaces with a very
is likely to yield information about the relations       high or even infinite dimensionality.
holding between the words’ referents. For exam-             Kernel functions have received significant at-
ple, the members of the pair (bear, f orest) may         tention in recent years, most notably due to the
tend to co-occur in contexts containing patterns         successful application of Support Vector Machines
such as w1 lives in the w2 and in the w2 ,. . . a w1 ,   (Cortes and Vapnik, 1995) to many problems. The
suggesting that a LOCATED IN or LIVES IN re-             SVM algorithm learns a decision boundary be-
lation frequently holds between bears and forests.       tween two data classes that maximises the mini-
If the contexts in which fish and reef co-occur are       mum distance or margin from the training points
similar to those found for bear and forest, this is      in each class to the boundary. The geometry of the
evidence that the same semantic relation tends to        space in which this boundary is set depends on the
kernel function used to compare data items. By                3.3   String embedding functions
tailoring the choice of kernel to the task at hand,           The necessary starting point for our implementa-
the user can use prior knowledge and intuition to             tion of relational similarity is a means of compar-
improve classification performance.                            ing contexts. Contexts can be represented in a va-
   One useful property of kernels is that any sum             riety of ways, from unordered bags of words to
or linear combination of kernel functions is itself           rich syntactic structures. The context representa-
a valid kernel. Theoretical analyses (Cristianini             tion adopted here is based on strings, which pre-
et al., 2001; Joachims et al., 2001) and empiri-              serve useful information about the order of words
cal investigations (e.g., Gliozzo et al. (2005)) have         in the context yet can be processed and compared
shown that combining kernels in this way can have             quite efficiently. String kernels are a family of ker-
a beneficial effect when the component kernels                 nels that compare strings s, t by mapping them
capture different “views” of the data while indi-             into feature vectors φString (s), φString (t) whose
vidually attaining similar levels of discriminative           non-zero elements index the subsequences con-
performance. In the experiments described below,              tained in each string.
we make use of this insight to integrate lexical and             A string is defined as a finite sequence s =
relational information for semantic classification             (s1 , . . . , sl ) of symbols belonging to an alphabet
of compound nouns.                                            Σ. Σl is the set of all strings of length l, and Σ∗ is
                                                              set of all strings or the language. A subsequence
3.2       Lexical kernels                                     u of s is defined by a sequence of indices i =
 ´ e
O S´ aghdha and Copestake (2008) demonstrate                  (i1 , . . . , i|u| ) such that 1 ≤ i1 < · · · < i|u| ≤ |s|,
how standard techniques for distributional similar-           where |s| is the length of s. len(i) = i|u| − i1 + 1
ity can be implemented in a kernel framework. In              is the length of the subsequence in s. An embed-
                                                                                               l
particular, kernels for comparing probability dis-            ding φString : Σ∗ → R|Σ| is a function that maps
tributions can be derived from standard probabilis-           a string s onto a vector of positive “counts” that
tic distance measures through simple transforma-              correspond to subsequences contained in s.
tions. These distributional kernels are suited to a              One example of an embedding function is a
data representation where each word w is identi-              gap-weighted embedding, defined as
fied with the a vector of conditional probabilities
(P (c1 |w), . . . , P (c|C| |w)) that defines a distribu-                 φgapl (s) = [            λlen(i) ]u∈Σl      (4)
tion over other terms c co-occurring with w. For                                       i:s[i]=u

example, the following positive semi-definite ker-             λ is a decay parameter between 0 and 1; the
nel between words can be derived from the well-               smaller its value, the more the influence of a dis-
known Jensen-Shannon divergence:                              continuous subsequence is reduced. When l = 1
                                                              this corresponds to a “bag-of-words” embedding.
 k jsd (w1 , w2 ) =                                           Gap-weighted string kernels implicitly compute
                                         P (c|w1 )            the similarity between two strings s, t as an inner
      −        [P (c|w1 ) log2 (                         )
                                   P (c|w1 ) + P (c|w2 )      product φ(s), φ(t) . Lodhi et al. (2002) present
           c
                                     P (c|w2 )                an efficient dynamic programming algorithm that
          + P (c|w2 ) log2 (                         )] (3)   evaluates this kernel in O(l|s||t|) time without ex-
                               P (c|w1 ) + P (c|w2 )
                                                              plicitly representing the feature vectors φ(s), φ(t).
A straightforward method of extending this model                 An alternative embedding is that used by Turney
to word pairs is to represent each pair (w1 , w2 ) as         (2008) in his PairClass system (see Section 6). For
the concatenation of the co-occurrence probability            the PairClass embedding φP C , an n-word context
vectors for w1 and w2 . Taking kjsd as a measure of           [0−1 words] N1|2 [0−3 words] N1|2 [0−1 words]
word similarity and introducing parameters α and
β to scale the contributions of w1 and w2 respec-             containing target words N1 , N2 is mapped onto
tively, we retrieve the lexical model of pair similar-        the 2n−2 patterns produced by substituting zero
ity defined above in (1). Without prior knowledge              or more of the context words with a wildcard ∗.
of the relative importance of each pair constituent,          Unlike the patterns used by the gap-weighted em-
it is natural to set both scaling parameters to 0.5,          bedding these are not truly discontinuous, as each
and this is done in the experiments below.                    wildcard must match exactly one word.
3.4    Kernels on sets                                     has unit L1 norm and defines a probability dis-
                                                                                                               1
String kernels afford a way of comparing individ-          tribution. Furthermore, scaling φSet (A) by |A| ,
ual contexts. In order to compute the relational           applying L2 vector normalisation and applying
similarity of two pairs, however, we do not want to        the linear kernel retrieves the averaged set kernel
associate each pair with a single context but rather       kave (A, B) as a special case of the distributional
with the set of contexts in which they appear to-          framework for sets of strings.
gether. In this section, we use string embeddings             Instead of requiring |A||B| basic kernel evalua-
to define kernels on sets of strings.                       tions for each pair of sets, distributional set kernels
   One natural way of defining a kernel over sets           only require the embedding φSet (A) to be com-
is to take the average of the pairwise basic kernel        puted once for each set and then a single vector
values between members of the two sets A and B.            inner product for each pair of sets. This is gen-
Let k0 be a kernel on a set X , and let A, B ⊆ X           erally far more efficient than the kernel averaging
be sets of cardinality |A| and |B| respectively. The       method. The significant drawback is that repre-
averaged kernel is defined as                               senting the feature vector for each set demands
                                                           a large amount of memory; for the gap-weighted
                        1                                  embedding with subsequence length l, each vec-
      kave (A, B) =                      k0 (a, b)   (5)
                      |A||B|
                               a∈A b∈B                     tor potentially contains up to |A| |smax | entries,
                                                                                                    l
                                                           where smax is the longest string in A. In practice,
This kernel was introduced by G¨ rtner et  a               however, the vector length will be lower due to
al. (2002) in the context of multiple instance learn-      subsequences occurring more than once and many
ing. It was first used for computing relational sim-        strings being shorter than smax .
            ´ e
ilarity by O S´ aghdha and Copestake (2007). The              One way to reduce the memory load is to re-
efficiency of the kernel computation is dominated           duce the lengths of the strings used, either by re-
by the |A| × |B| basic kernel calculations. When           taining just the part of each string expected to be
each basic kernel calculation k0 (a, b) has signifi-        informative or by discarding all strings longer than
cant complexity, as is the case with string kernels,       an acceptable maximum. The PairClass embed-
calculating kave can be slow.                              ding function implicitly restricts the contexts con-
   A second perspective views each set as corre-           sidered by only applying to strings where no more
sponding to a probability distribution, and takes          than three words occur between the targets, and by
the members of that set as observed samples from           ignoring all non-intervening words except single
that distribution. In this way a kernel on distribu-       ones adjacent to the targets. A further technique
tions can be cast as a kernel on sets. In the case of      is to trade off time efficiency for space efficiency
sets whose members are strings, a string embed-            by computing the set kernel matrix in a blockwise
ding φString can be used to estimate a probability         fashion. To do this, the input data is divided into
distribution over subsequences for each set by tak-        blocks of roughly equal size – the size that is rele-
ing the normalised sum of the feature mappings of          vant here is the sum of the cardinalities of the sets
its members:                                               in a given block. Larger block sizes b therefore
                         1                                 allow faster computation, but they require more
           φSet (A) =              φString (s)       (6)
                         Z                                 memory. In the experiments described below, b
                             s∈A
                                                           was set to 5,000 for embeddings of length l = 1
where Z is a normalisation factor. Different               and l = 2, and to 3,000 for l = 3.
choices of φString yield different relational simi-
larity models. In this paper we primarily use the          4     Experimental setup for compound
gap-weighted embedding φgapl ; we also discuss                   noun interpretation
the PairClass embedding φP C for comparison.
                                                           4.1    Dataset
   Once the embedding φSet has been calculated,
any suitable inner product can be applied to the                                                  ´ e
                                                           The dataset used in our experiments is O S´ aghdha
resulting vectors, e.g. the linear kernel (dot prod-       and Copestake’s (2007) set of 1,443 compound
uct) or the Jensen-Shannon kernel defined in (3).           nouns extracted from the British National Corpus
In the latter case, which we term kjsd below, the          (BNC).1 Each compound is annotated with one of
natural choice for normalisation is the sum of the           1
                                                               The data are available from http://www.cl.cam.
entries in s∈A φString (s), ensuring that φSet (A)         ac.uk/˜do242/resources.html.
six semantic relations: BE, HAVE, IN, AGENT, IN-          constituents equally and ensure that the new vec-
STRUMENT and ABOUT. For example, air disas-               tor sums to 1. To perform classification with these
ter is labelled IN (a disaster in the air) and freight    features we use the Jensen-Shannon kernel (3).3
train is labelled INSTRUMENT (a train that car-
ries freight). The best previous classification result     4.4    Relational features
                                    ´ e
on this dataset was reported by O S´ aghdha and           To extract data for computing relational similarity,
Copestake (2008), who achieved 61.0% accuracy             we searched a large corpus for sentences in which
and 58.8% F-score with a purely lexical model of          both constituents of a compound co-occur. The
compound similarity.                                      corpora used here are the written BNC, contain-
4.2   General Methodology                                 ing 90 million words of British English balanced
                                                          across genre and text type, and the English Giga-
All experiments were run using the LIBSVM Sup-            word Corpus, 2nd Edition (Graff et al., 2005), con-
port Vector Machine library.2 The one-versus-all          taining 2.3 billion words of newswire text. Extrac-
method was used to decompose the multiclass task          tion from the Gigaword Corpus was performed at
into six binary classification tasks. Performance          the paragraph level as the corpus is not annotated
was evaluated using five-fold cross-validation. For        for sentence boundaries, and a dictionary of plural
each fold the SVM cost parameter was optimised            forms and American English variants was used to
in the range (2−6 , 2−4 , . . . , 212 ) through cross-    expand the coverage of the corpus trawl.
validation on the training set.                              The extracted contexts were split into sentences,
   All kernel matrices were precomputed on near-          tagged and lemmatised with RASP. Duplicate sen-
identical machines with 2.4 Ghz 64-bit processors         tences were discarded, as were sentences in which
and 8Gb of memory. The kernel matrix compu-               the compound head and modifier were more than
tation is trivial to parallelise, as each cell is inde-   10 words apart. Punctuation and tokens containing
pendent. Spreading the computational load across          non-alphanumeric characters were removed. The
multiple processors is a simple way to reduce the         compound modifier and head were replaced with
real time cost of the procedure.                          placeholder tokens M:n and H:n in each sentence
4.3   Lexical features                                    to ensure that the classifier would learn from re-
                                                          lational information only and not from lexical in-
Our implementation of the lexical similarity
                                        ´ e               formation about the constituents. Finally, all to-
model uses the same feature set as O S´ aghdha
                                                          kens more than five words to the left of the left-
and Copestake (2008). Two corpora were used
                                                          most constituent or more than five words to the
to extract co-occurrence information: the writ-
                                                          right of the rightmost constituent were discarded;
ten component of the BNC (Burnard, 1995) and
                                                          this has the effect of speeding up the kernel com-
the Google Web 1T 5-Gram Corpus (Brants and
                                                          putations and should also focus the classifier on
Franz, 2006). For each noun appearing as a com-
                                                          the most informative parts of the context sen-
pound constituent in the dataset, we estimate a co-
                                                          tences. Examples of the context strings extracted
occurrence distribution based on the nouns in co-
                                                          for the modifier-head pair (history,book) are the:a
ordinative constructions. Conjunctions are identi-
                                                          1957:m pulitizer:n prize-winning:j H:n describe:v
fied in the BNC by first parsing the corpus with
                                                          event:n in:i american:j M:n when:c elect:v of-
RASP (Briscoe et al., 2006) and extracting in-
                                                          ficial:n take:v principle:v and he:p read:v con-
stances of the conj grammatical relation. As the
                                                          stantly:r usually:r H:n about:i american:j M:n
5-Gram corpus does not contain full sentences it
                                                          or:c biography:n.
cannot be parsed, so regular expressions were used
                                                             This extraction procedure resulted in a corpus
to extract coordinations. In each corpus, the set of
                                                          of 1,472,798 strings. There was significant varia-
co-occurring terms is restricted to the 10,000 most
                                                          tion in the number of context strings extracted for
frequent conjuncts in that corpus so that each con-
                                                          each compound: 288 compounds were associated
stituent distribution is represented with a 10,000-
                                                          with 1,000 or more sentences, while 191 were as-
dimensional vector. The probability vector for the
compound is created by appending the two con-                3 ´ e
                                                               O S´ aghdha and Copestake (2008) achieve their single
stituent vectors, each scaled by 0.5 to weight both       best result with a different kernel (the Jensen-Shannon RBF
                                                          kernel), but the kernel used here (the Jensen-Shannon lin-
  2
    http://www.csie.ntu.edu.tw/˜cjlin/                    ear kernel) generally achieves equivalent performance and
libsvm                                                    presents one fewer parameter to optimise.
                     kjsd           kave                 l = 1 and the summed kernels kΣ12 = kl=1 +kl=2 .
       Length     Acc     F      Acc     F               The best performance of 52.1% accuracy, 49.9%
       1          47.9 45.8      43.6 40.4               F-score is obtained with the Jensen-Shannon ker-
       2          51.7 49.5      49.7 48.3               nel kjsd computed on the summed feature embed-
       3          50.7 48.4      50.1 48.6               dings of length 2 and 3. This is significantly lower
       Σ12        51.5 49.6      48.3 46.8                                                      ´ e
                                                         than the performance achieved by O S´ aghdha
       Σ23        52.1 49.9      50.9 49.5               and Copestake (2008) with their lexical similar-
       Σ123       51.3 49.0      50.5 49.1               ity model, but it is well above the majority class
       φP C       44.9 43.3      40.9 40.0               baseline (21.3%). Results for the PairClass em-
                                                         bedding are much lower than for the gap-weighted
Table 1: Results for combinations of embedding           embedding; the superiority of φgapl is statistically
functions and set kernels                                significant in all cases except l = 1.
sociated with 10 or fewer and no sentences were             Results for combinations of lexical co-
found for 45 constituent pairs. The largest context      occurrence kernels and (gap-weighted) relational
sets were predominantly associated with political        set kernels are given in Table 2. With the excep-
or economic topics (e.g., government official, oil        tion of some combinations of the length-1 set
price), reflecting the journalistic sources of the Gi-    kernel, these results are clearly better than the
gaword sentences.                                        best results obtained with either the lexical or
   Our implementation of relational similarity ap-       the relational model taken alone. The best result
plies the two set kernels kave and kjsd defined in        is obtained by the combining the lexical kernel
Section 3.4 to these context sets. For each kernel       computed on BNC conjunction features with the
we tested gap-weighted embedding functions with          summed Jensen-Shannon set kernel kΣ23 ; this
subsequence length values l in the range 1, 2, 3,        combination achieves 63.1% accuracy and 61.6%
as well as summed kernels for all combinations           F-score, a statistically significant improvement (at
of values in this range. The decay parameter λ           the p < 0.01 level) over the lexical kernel alone
for the subsequence feature embedding was set to         and the best result yet reported for this dataset.
0.5 throughout, in line with previous recommen-          Also, the benefit of combining set kernels of
dations (e.g., Cancedda et al. (2003)). To inves-        different subsequence lengths l is evident; of the
tigate the effects of varying set sizes, we ran ex-      12 combinations presented Table 2 that include
periments with context sets of maximal cardinality       summed set kernels, nine lead to statistically
q ∈ {50, 250, 1000}. These sets were randomly            significant improvements over the corresponding
sampled for each compound; for compounds asso-           lexical kernels taken alone (the remaining three
ciated with fewer strings than the maximal cardi-        are also close to significance).
nality, all associated strings were used. For q = 50        Our experiments also show that the distribu-
we average results over five runs in order to re-         tional implementation of set kernels (6) is much
duce sampling variation. We also report some             more efficient than the averaging implementation
results with the PairClass embedding φP C . The          (5). The time behaviour of the two methods
restricted representative power of this embedding        with increasing set cardinality q and subsequence
brings greater efficiency and we were able to use         length l is illustrated in Figure 1. At the largest
q = 5, 000; for all but 22 compounds, this allowed       tested values of q and l (1,000 and 3, respectively),
the use of all contexts for which the φP C embed-        the averaging method takes over 33 days of CPU
ding was defined.                                         time, while the distributional method takes just
                                                         over one day. In theory, kave scales quadratically
5   Results
                                                         as q increases; this was not observed because for
Table 1 presents results for classification with re-      many constituent pairs there are not enough con-
lational set kernels, using q = 1, 000 for the gap-      text strings available to keep adding as q grows
weighted embedding. In general, there is little dif-     large, but the dependence is certainly superlinear.
ference between the performance of kjsd and kave         The time taken by kjsd is theoretically linear in q,
with φgapl ; the only statistically significant differ-   but again scales less dramatically in practice. On
ences (at p < 0.05, using paired t-tests) are be-        the other hand kave is linear in l, while kjsd scales
tween the kernels kl=1 with subsequence length           exponentially. This exponential dependence may
                                                     kjsd                                                                           kave
                                            BNC                            5-Gram                         BNC                               5-Gram
                   Length          Acc             F                    Acc      F                    Acc          F                     Acc       F
                   1              60.6           58.6                   60.3    58.1                  59.5       57.6                   59.1     56.5
                   2              61.9*          60.4*                  62.6    60.8                  62.0       60.5*                  61.3     59.1
                   3              62.5*          60.8*                  61.7    59.9                  62.8*      61.2**                 62.3** 60.8**
                   Σ12            62.6*          61.0**                 62.3* 60.6*                   62.0*      60.3*                  61.5     59.2
                   Σ23            63.1**         61.6**                 62.3* 60.5*                   62.2*      60.7*                  62.0     60.3
                   Σ123           62.9**         61.3**                 62.6    60.8*                 61.9*      60.4*                  62.4*    60.6*
                   No Set         59.9           57.8                   60.2    58.1                  59.9       57.8                   60.2     58.1

Table 2: Results for set kernel and lexical kernel combination. */** indicate significant improvement at
the 0.05/0.01 level over the corresponding lexical kernel alone, estimated by paired t-tests.
               8                                                    8                                                          8
              10                                                   10                                                         10


                                          kave                                                 kave                                                       kave
               6                                                    6                                                          6
              10                                                   10                                                         10
                                                                                                                                                          kjsd
     time/s




                                                          time/s




                                                                                                                     time/s
               4                                                    4                          kjsd                            4
              10                                                   10                                                         10
                                          kjsd

               2                                                    2                                                          2
              10                                                   10                                                         10


               0                                                    0                                                          0
              10                                                   10                                                         10
                   50   250                        1000                 50   250                              1000                 50   250                      1000
                                   q                                                    q                                                          q

                              (a) l = 1                                            (b) l = 2                                                  (c) l = 3

     Figure 1: Timing results (in seconds, log-scaled) for averaged and Jensen-Shannon set kernels

seem worrying, but in practice only short subse-                                                                      ´ e
                                                                                            distributional model of O S´ aghdha and Copes-
quence lengths are used with string kernels. In                                             take (2008). The idea of using relational similar-
situations where set sizes are small but long sub-                                          ity to understand compounds goes back at least as
sequence features are desired, the averaging ap-                                            far as Lebowitz’ (1988) RESEARCHER system,
proach may be more appropriate. However, it                                                 which processed patent abstracts in an incremental
seems likely that many applications will be sim-                                            fashion and associated an unseen compound with
ilar to the task considered here, where short sub-                                          the relation expressed in a context where the con-
sequences are sufficient and it is desirable to use                                          stituents previously occurred.
as much data as possible to represent each set.                                                Turney (2006) describes a method (Latent Rela-
We note that calculating the PairClass embedding,                                           tional Analysis) that extracts subsequence patterns
which counts far fewer patterns, took just 1h21m.                                           for noun pairs from a large corpus, using query
For optimal efficiency, it seems best to use a gap-                                          expansion to increase the recall of the search and
weighted embedding with small set cardinality;                                              feature selection and dimensionality reduction to
averaged across five runs kjsd with q = 50 and                                               reduce the complexity of the feature space. LRA
l = Σ123 took 26m to calculate and still achieved                                           performs well on analogical tasks including com-
47.6% Accuracy, 45.1% F-score.                                                              pound interpretation, but has very substantial re-
                                                                                            source requirements. Turney (2008) has recently
6   Related work
                                                                                            proposed a simpler SVM-based algorithm for ana-
Turney et al. (2003) suggest combining various in-                                          logical classification called PairClass. While it
formation sources for solving SAT analogy prob-                                             does not adopt a set-based or distributional model
lems. However, previous work on compound in-                                                of relational similarity, we have noted above that
terpretation has generally used either lexical simi-                                        PairClass implicitly uses a feature representation
larity or relational similarity but not both in com-                                        similar to the one presented above as (6) by ex-
bination. Previously proposed lexical models in-                                            tracting subsequence patterns from observed co-
clude the WordNet-based methods of Kim and                                                  occurrences of word pair members. Indeed, Pair-
Baldwin (2005) and Girju et al. (2005), and the                                             Class can be viewed as a special case of our frame-
work; the differences from the model we have               this seems more intuitive for string kernel embed-
used consist in the use of a different embedding           dings that map strings onto vectors of positive-
function φP C and a more restricted notion of con-         valued “counts”. Experiments with Kondor and
text, a frequency cutoff to eliminate less common          Jebara’s Bhattacharrya kernel indicate that it can
subsequences and the Gaussian kernel to compare            in fact come close to the performances reported
vectors. While we cannot compare methods di-               in Section 5 but has significantly greater compu-
rectly as we do not possess the large corpus of            tational requirements due to the need to perform
5 × 1010 words used by Turney, we have tested              costly matrix manipulations.
the impact of each of these modifications on our
model.4 None improve performance with our set              7    Conclusion and future directions
kernels, but the only statistically significant effect      In this paper we have presented a combined model
is that of changing the embedding model as re-             of lexical and relational similarity for relational
ported in section Section 5. Implementing the full         reasoning tasks. We have developed an efficient
PairClass algorithm on our corpus yields 46.2%             and flexible kernel-based framework for compar-
accuracy, 44.9% F-score, which is again signifi-            ing sets of contexts using the feature embedding
cantly worse than all results for the gap-weighted         associated with a string kernel.5 By choosing a
model with l > 1.                                          particular embedding function and a particular in-
   In NLP, there has not been widespread use of            ner product on subsequence vectors, the previ-
set representations for data items, and hence set          ously proposed set-averaging and PairClass algo-
classification techniques have received little at-          rithms for relational similarity can be retrieved as
tention. Notable exceptions include Rosario and            special cases. Applying our methods to the task
Hearst (2005) and Bunescu and Mooney (2007),               of compound noun interpretation, we have shown
who tackle relation classification and extraction           that combining lexical and relational similarity is a
tasks by considering the set of contexts in which          very effective approach that surpasses either simi-
the members of a candidate relation argument pair          larity model taken individually.
co-occur. While this gives a set representation for           Turney (2008) argues that many NLP tasks can
each pair, both sets of authors apply classifica-           be formulated in terms of analogical reasoning,
tion methods at the level of individual set mem-           and he applies his PairClass algorithm to a number
bers rather than directly comparing sets. There            of problems including SAT verbal analogy tests,
is also a close connection between the multino-            synonym/antonym classification and distinction
mial probability model we have proposed and the            between semantically similar and semantically as-
pervasive bag of words (or bag of n-grams) repre-          sociated words. Our future research plans include
sentation. Distributional kernels based on a gap-          investigating the application of our combined sim-
weighted feature embedding extend these models             ilarity model to analogical tasks other than com-
by using bags of discontinuous n-grams and down-           pound noun interpretation. A second promising
weighting gappy subsequences.                              direction is to investigate relational models for un-
   A number of set kernels other than those dis-           supervised semantic analysis of noun compounds.
cussed here have been proposed in the machine              The range of semantic relations that can be ex-
learning literature, though none of these propos-          pressed by compounds is the subject of some con-
als have explicitly addressed the problem of com-          troversy (Ryder, 1994), and unsupervised learning
paring sets of strings or other structured objects,        methods offer a data-driven means of discovering
and many are suitable only for comparing sets of           relational classes.
small cardinality. Kondor and Jebara (2003) take a
distributional approach similar to ours, fitting mul-       Acknowledgements
tivariate normal distributions to the feature space        We are grateful to Peter Turney, Andreas Vla-
mappings of sets A and B and comparing the map-            chos and the anonymous EACL reviewers for their
pings with the Bhattacharrya vector inner product.         helpful comments. This work was supported in
The model described above in (6) implicitly fits            part by EPSRC grant EP/C010035/1.
multinomial distributions in the feature space F;              5
                                                                 The treatment presented here has used a string represen-
                                                           tation of context, but the method could be extended to other
   4
     Turney (p.c.) reports that the full PairClass model   structural representations for which substructure embeddings
achieves 50.0% accuracy, 49.3% F-score.                    exist, such as syntactic trees (Collins and Duffy, 2001).
References                                               Alfio Gliozzo, Claudio Giuliano, and Carlo Strappar-
                                                           ava. 2005. Domain kernels for word sense disam-
Thorsten Brants and Alex Franz, 2006. Web 1T 5-gram        biguation. In Proceedings of the 43rd Annual Meet-
  Corpus Version 1.1. Linguistic Data Consortium.          ing of the Association for Computational Linguistics
                                                           (ACL-05).
Ted Briscoe, John Carroll, and Rebecca Watson. 2006.
  The second release of the RASP system. In Pro-         David Graff, Junbo Kong, Ke Chen, and Kazuaki
  ceedings of the ACL-06 Interactive Presentation          Maeda, 2005. English Gigaword Corpus, 2nd Edi-
  Sessions.                                                tion. Linguistic Data Consortium.
Razvan C. Bunescu and Raymond J. Mooney. 2007.           Thorsten Joachims, Nello Cristianini, and John Shawe-
  Learning to extract relations from the Web using         Taylor. 2001. Composite kernels for hypertext cate-
  minimal supervision. In Proceedings of the 45th An-      gorisation. In Proceedings of the 18th International
  nual Meeting of the Association for Computational        Conference on Machine Learning (ICML-01).
  Linguistics (ACL-07).                                  Su Nam Kim and Timothy Baldwin. 2005. Automatic
                                                           interpretation of noun compounds using WordNet
Lou Burnard, 1995. Users’ Guide for the British Na-        similarity. In Proceedings of the 2nd International
  tional Corpus. British National Corpus Consortium.       Joint Conference on Natural Language Processing
Cristina Butnariu and Tony Veale. 2008. A concept-         (IJCNLP-05).
  centered approach to noun-compound interpretation.     Risi Kondor and Tony Jebara. 2003. A kernel between
  In Proceedings of the 22nd International Conference      sets of vectors. In Proceedings of the 20th Interna-
  on Computational Linguistics (COLING-08).                tional Conference on Machine Learning (ICML-03).
Nicola Cancedda, Eric Gaussier, Cyril Goutte, and        Michael Lebowitz. 1988. The use of memory in
  Jean-Michel Renders. 2003. Word-sequence ker-            text processing. Communications of the ACM,
  nels.  Journal of Machine Learning Research,             31(12):1483–1502.
  3:1059–1082.                                           Huma Lodhi, Craig Saunders, John Shawe-Taylor,
                                                           Nello Cristianini, and Christopher J. C. H. Watkins.
Michael Collins and Nigel Duffy. 2001. Convolution         2002. Text classification using string kernels. Jour-
  kernels for natural language. In Proceedings of the      nal of Machine Learning Research, 2:419–444.
  15th Conference on Neural Information Processing
  Systems (NIPS-01).                                                ´ e
                                                         Diarmuid O S´ aghdha and Ann Copestake. 2007. Co-
                                                           occurrence contexts for noun compound interpreta-
Corinna Cortes and Vladimir Vapnik. 1995. Support          tion. In Proceedings of the ACL-07 Workshop on A
  vector networks. Machine Learning, 20(3):273–            Broader Perspective on Multiword Expressions.
  297.
                                                                    ´ e
                                                         Diarmuid O S´ aghdha and Ann Copestake. 2008. Se-
Nello Cristianini, Jaz Kandola, Andre Elisseeff, and       mantic classification with distributional kernels. In
  John Shawe-Taylor. 2001. On kernel target align-         Proceedings of the 22nd International Conference
  ment. Technical Report NC-TR-01-087, Neuro-              on Computational Linguistics (COLING-08).
  COLT.                                                  Barbara Rosario and Marti A. Hearst. 2005. Multi-
James Curran. 2004. From Distributional to Seman-          way relation classification: Application to protein-
  tic Similarity. Ph.D. thesis, School of Informatics,     protein interactions. In Proceedings of the 2005
  University of Edinburgh.                                 Human Language Technology Conference and Con-
                                                           ference on Empirical Methods in Natural Language
Barry Devereux and Fintan Costello. 2007. Learning         Processing (HLT-EMNLP-05).
  to interpret novel noun-noun compounds: Evidence
                                                         Mary Ellen Ryder. 1994. Ordered Chaos: The Inter-
  from a category learning experiment. In Proceed-
                                                           pretation of English Noun-Noun Compounds. Uni-
  ings of the ACL-07 Workshop on Cognitive Aspects
                                                           versity of California Press, Berkeley, CA.
  of Computational Language Acquisition.
                                                         John Shawe-Taylor and Nello Cristianini. 2004. Ker-
          a
Thomas G¨ rtner, Peter A. Flach, Adam Kowalczyk,           nel Methods for Pattern Analysis. Cambridge Uni-
  and Alex J. Smola. 2002. Multi-instance kernels.         versity Press, Cambridge.
  In Proceedings of the 19th International Conference
  on Machine Learning (ICML-02).                         Peter D. Turney, Michael L. Littman, Jeffrey Bigham,
                                                           and Victor Shnayder. 2003. Combining indepen-
Roxana Girju, Dan Moldovan, Marta Tatu, and Daniel         dent modules to solve multiple-choice synonym and
  Antohe.     2005.   On the semantics of noun             analogy problems. In Proceedings of the 2003 Inter-
  compounds.     Computer Speech and Language,             national Conference on Recent Advances in Natural
  19(4):479–496.                                           Language Processing (RANLP-03).
Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Sz-      Peter D. Turney. 2006. Similarity of semantic rela-
  pakowicz, Peter Turney, and Deniz Yuret. 2007.           tions. Computational Linguistics, 32(3):379–416.
  SemEval-2007 Task 04: Classification of seman-          Peter D. Turney. 2008. A uniform approach to analo-
  tic relations between nominals. In Proceedings of        gies, synonyms, antonyms, and associations. In Pro-
  the 4th International Workshop on Semantic Evalu-        ceedings of the 22nd International Conference on
  ations (SemEval-07).                                     Computational Linguistics (COLING-08).