2005 Hulet, Steven R
Document Sample


Hulet, Steve
Synonymity of Memorized Patterns:
Considering Patterns and Word Order in Unsupervised
Language Learning
Faculty Mentor: Sean Warnick, Computer Science
Despite the ever-increasing abilities of computers, natural language analysis is still a challenge. The
intricacies of natural language are far too many to enumerate, giving rise to automated algorithms
which learn how the language is used from large text corpora. Many current methods use complex sta-
tistical approaches involving multi-dimensional vectors and factor analysis, and admittedly fail to take
into account valuable contextual information such as word order. This project continued to develop
a novel, simple, unsupervised learning approach for determining word similarity based on context in
reoccurring word sequences; previous work confirms that similar words are used in similar contexts.
The algorithm reads raw, untagged text, recording repeated word-use patterns and their frequency.
These patterns preserve the context of words by remembering where each word appears in relation to
those surrounding it. By examining the context of a target word, other words or phrases with similar
meaning can be found within similar contexts.
Our learning model was developed in cooperation with Sandia National Laboratories. The underlying
algorithm is known as S-learning; applied to natural language processing the method is known as
Synonymity of Memorized Patterns (SMP). While our implementation of the process was complete
and preliminary results received, we lacked formal methods to compare the performance of our learner
with previously exiting learners. Developing such methods was the goal of this project. We pro-
posed to develop a mathematical formulation of SMP to allow for easier critique, comprehension, and
comparison of the algorithm. Our plans to compare head-to-head SMP with Latent Semantic Analy-
sis (LSA), a well known statistically based word-similarity prediction algorithm, were foiled when
the BYU Computer Science Department removed critical software from their lab machines halfway
through the development process.
We formulated the synonym detection problem thus: given a large unparsed text corpus and a target
word, automatically find synonyms of the target word and order them according to relevancy. The
concept of word similarity is somewhat subjective, but testing methods do exist, and in this applica-
tion intuition is often a reasonable guide.
We model natural language as a Markov process. Each state S1 , S2 , ..., Sn represents a unique sequence
of k words; thus for a language of w words there are wk possible states. Each edge from Si to Sj
has transition probability pi (j) = P (Sj |Si ) and generates the word wj when taken. The sequence of
words a state represents serves to remember the k words most recently seen prior to arriving at that
state. With such a model it is possible to calculate the probability that any set of words will appear
in any given sequence. To do this we find a path following the edges which generate the desired word
sequence, and multiply the transition probabilities. For the string wi , wi+1 , . . . , wi+n , we first find the
state Sj which represents the first k words of the string. We then follow the edges which generate
1
the remaining words in the string, multiplying the transition probabilities, until we end in the state
representing the last k words of the string. A string of n words is represented by n − k + 1 states.
Thus the probability of any arbitrary path [Sj , Sj+1 , . . . , Sj+(n−k) ] can be expressed by
i+n−1
P [Sj , Sj+1 , . . . , Sj+(n−k) ] = P (wi+k , wi+(k+1) , . . . , wi+n |Sj ) = pk (k + 1) .
k=j
One common measure of determining word similarity is the degree to which one word can be sub-
stituted for another, that is, the extent to which two words can be found in identical contexts. A
context is defined as the states in the path on either side of a word. For convenience in notation,
the state immediately preceding the word of interest is labeled the initial state, S0 , and the state
immediately following the word of interest is labeled the final state, Sf . The set of words that can be
substituted into a given context can be found by identifying alternate paths with nonzero path prob-
abilities between the initial and final states. Including the initial and final states, the length of the
path between S0 and Sf is k + 2 states. Thus, for a context S0 and Sf , where Sf = wf1 , wf2 , . . . , wfk ,
the “replacement set” of words is given by words for which the path probability P > 0 and is given by
P (wj , wf1 , . . . , wfk |Si ). That is, we find probability of that phrase occurring for every possible word wj .
Synonyms for a specific word ws are obtained by finding all contexts in which ws occurs, calculating
the replacement set for each of those contexts, and combining the replacement sets into a synonym
set. Factoring in the frequency with which both ws and each replacement word appears in a context
allows for ordering of the synonym set. All words ranking high in such an ordering may be said to be
similar to each other.
When k = 1 the Markov model becomes first order. This model has the property that when calculating
the occurrence probabilities of state paths that differ by a single word wj , the paths transversing the
state space are identical through wj−1 and are again identical starting with wj+1 . First order models
allow only a limited sense of context; only the current word influences the decision of what word will
follow. To allow more of the context to affect each transition, we increase the order of the model.
Each transition probability in an nth order model is influenced by the preceding n words. Since word
selection in natural language is not based solely on the word immediately preceding, but on a larger
context, a model with a larger k seems more appropriate.
Given such a model, knowing more context around a target word ws will help us find better syn-
onyms of ws . Suppose we are looking for synonyms of wb in the string wa , wb , wc . We may learn
that P (wa , wi , wc ) = P (wa , wj , wc ), and P (wa , wi , wc ) > P (wa , wk , wc ) for all k = i, j, thus providing
no information as to which is the best synonym. If we are then able to learn that the string we are
interested in, wa , wb , wc , comes from the greater context wz , wa , wb , wc , wd , this provides additional
information. If P (wz , wa , wi , wc , wd ) = P (wz , wa , wj , wc , wd ) then one of the candidates can be judged
a better synonym than the other. If P (wz , wa , wi , wc , wd ) is high and P (wz , wa , wj , wc , wd ) is low, than
wi is the clear winner. This simply indicates that the substring wa , wj , wc is common within some
other context. In the case P (wz , wa , wi , wc , wd ) = P (wz , wa , wj , wc , wd ), then this model is affirming
the equal probability of wi and wj .
This model is still incomplete and suffers from a number of flaws. For example, there are problems
when trying to analyze strings with fewer than k words—such a sequence looses probabilistic meaning
in this model. A large question remains in determining the optimal order to use when modeling natural
language. To circumvent this, as well as to optimally use all available information in any situation, an
interesting line of future research may include multi-order Markov models, which compute and then
combine results from n j th order Markov models, 1 ≤ j ≤ n.
2
Related docs
Get documents about "