Free construction of a free dictionary of synonyms using

Document Sample
Free construction of a free dictionary of synonyms using Powered By Docstoc
					    Free construction of a
free dictionary of synonyms
   using computer science


 Viggo Kann and Magnus Rosell
        KTH, Stockholm
  Talk given by Viggo at Amherst College
            November 11, 2006
Examples of English synonyms
Smith: A Dictionary of     Webster’s Dictionary of
  Synonymous Words          Synonyms [1942]
  in the English           classify.
  Language [1889]            Alphabetize,
CLASS.                       pigeonhole, assort,
  Order. Rank. Degree.       sort.
  Classification. Grade.     Ana. Order, arrange,
                             systematize,
                             methodize, marshal.
Goals

 To construct a Swedish dictionary of
  synonyms as a list of synonymous pairs
 I don’t want to work a lot
 I don’t want to pay anyone to work
 The resulting list should be free
Ideas

 Automatically construct a large set of
  word pairs that might be synonyms
 Use ten thousands of people, who are
  each willing to make a small
  contribution without payment, to check
  the word pairs
More ideas

 Use the Lexin on-line Swedish-English
  dictionary web site, that had 9 millions
  (now 17 M) of lookups each month
 Users visit Lexin to translate words, and
  are thus probably motivated to help me
 Each time a user makes a lookup, give
  her the opportunity to decide whether
  two words are synonyms or not
My plan

1.   Construct lots of possible synonyms
2.   Sort out bad synonym pairs
     automatically
3.   Ask lots of users if the rest of the pairs
     are good synonyms
4.   Analyze the gradings done by the
     users and decide which pairs to keep
Step 1:
Construct lots of possible synonyms

 If we have access to a Swedish-English
  dictionary SE and an English-Swedish
  dictionary ES, try to translate each word
  to English and back again to Swedish
 {(w,v): y: ySE(w)  vES(y)} or
  {(w,v): y: ySE(w)  ySE(v)}
 616 000 word pairs were generated
Step 2: Sort out bad synonym pairs
automatically

 Use RI (Random Indexing)
  [Kanerva, Kristoferson, Holst 2000]
  to measure the distance between words
  represented in a large vector space
 Keep pairs that have small enough
  distance in the vector space
Random Indexing

 Each word w is assigned a random
  label vector Lw of thousand elements
 For each word w construct a context
  vector Cw by adding the random vectors
  for the words appearing in the context of
  each occurrence of w in a large corpus
Random Indexing settings

 Context:
  4 words to the left and 4 to the right
  Stop words were removed
 Dimensionality: 1800
 5 corpora from different domains were
  used, for example newspapers and
  medical texts
Number of pairs for different cos thresholds
(435 000 of 616 000 pairs occurred in corpus)

 450000
 400000
 350000
 300000
 250000
 200000                                                    pairs
 150000
 100000
  50000
     0
          -0.05 0.0 0.05 0.1* 0.15 0.2 0.25 0.3 0.35 0.4
Step 3: Ask lots of users if the rest of
the pairs are good synonyms

When a user has sent a word to the Lexin
 dictionary he receives the translation
 followed by a question like:
Are 'spread' and 'lengthen' synonyms?
 Answer using a scale from 0 to 5 where 0
 means 'I don’t agree' and 5 means
 'I do fully agree', or answer 'I don’t know'
After answering the user may

 grade new randomly chosen word pair
 look up word in the synonym dictionary
 suggest new synonymous word pair
 download synonym dictionary in XML
Step 4: Analyzing the gradings
done by the users
 1.2 millions gradings were made in less
  than 2 months
 Grading statistics were analyzed on
  several occasions
 Some users sent comments
Keeping the users happy!

 Many users said that there were too
  many bad pairs
 Lots of pairs were graded 0 (not at all
  synonyms) by all users. After some
  weeks 25 000 such pairs were
  removed. Later 60 000 more pairs were
  removed, improving the quality of the
  remaining pairs considerably.
User gradings first two months

  60%

  50%

  40%

  30%

  20%

  10%

  0%
        0   1   2   3   4   5   don't
                                know
More interesting gradings 2006

60%

50%

40%
                                      2005
30%
                                      2006
20%

10%

0%
      0   1   2   3   4   5   don't
                              know
Distribution of mean gradings of
word pairs after two months
 40%
 35%
 30%
 25%
 20%                                                 217 000 pairs
 15%
 10%
 5%
 0%
       0   0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5   5
Distribution of mean gradings of
word pairs 2006
40%
35%
30%
25%
                                                                    2005
20%
                                                                    2006
15%
10%
5%
0%
      0   0.5   1.0   1.5   2.0   2.5   3.0   3.5   4.0   4.5   5
Analysis of the pairs graded 0
Distance (cosine) in RI space

 50%

 40%

 30%
                                           0 pairs
                                           all pairs
 20%

 10%

 0%
       0,1   0,2   0,3   0,4   0,5   0,6
Some statistics (November 2006)

 2.5 M user gradings done
 67 000 pairs (graded ≥ 2) in dictionary
 90 000 pairs suggested by users
 50 000 unique pairs suggested
 14 000 of them have been accepted
Example: Synonyms to klass (class)
5: rang (grade)          3: skikt (layer)
  rank (rank)              sort (sort)
  slag (kind)              standard (standard)
4: kategori (category)     stil (style)
  stånd (social class)   2: storleksordning
  årskurs (grade)          (magnitude)
3: fack (sphere)           typ (type)
  grad (degree)          1: poäng (point)
  grupp (group)            stadga (stability)
  kvalitet (quality)     0: uppdrag (mission)
  nivå (level)             utbilda (educate)
  ordning (order)
How to prevent abuse?
 Many gradings of a word pair are
  needed before it’s considered to be
  good
 The pair to be graded is randomly
  picked from a very large list
 Word pairs suggested by users are spell
  checked before they are added to the
  very large list
People's definition of synonymy

 Exact meaning of 'synonym' wasn’t
  defined
 Users will grade using their intuitive
  understanding of the concept of
  synonymy and the words in the pair
 The produced dictionary will use the
  people's own definition of synonymy
  Hopefully this is exactly what they want!
The people’s synonym dictionary
on the web

http://lexin.nada.kth.se/cgi-bin/synlex
Lessons learned

 The list of suggested synonyms should
  be huge
 Try to improve the quality of the list
  automatically as much as possible,
  Random indexing is useful for this, also
  try tagging and using other dictionaries
 Use the 0 answers early to remove bad
  pairs that only irritate the users