Making the Nearest Neighbor Meaningful
Document Sample


Making the Nearest Neighbor Meaningful
Daniel Tunkelang, Endeca
The nearest-neighbor problem arises in clustering and other applications. It requires us to define a function to
measure differences among items in a data set, and then to compute the closest items to a query point with respect to
this measure. Recent work suggests that the conventional Euclidean measure does not adequately model high-
dimensional data. We present a new, data-driven difference measure for categorical data for which the difference
between two data points is based on the frequency of the categories or combinations of categories that they have in
common. This measure addresses the main flaw of the Euclidean distance measure—namely, that it treats each
dimension independently. We then provide both brute-force algorithms and an efficient, but approximate,
probabilistic algorithm to compute the nearest neighbors of a query point with respect to this measure. Finally, we
illustrate a practical application of our approach in a recommendation engine built for the Tower Records online video
and DVD catalog.
1 Background
The data clustering problem is that of partitioning a set of items into clusters so that two items in the same cluster are
more similar than two items in different clusters. As such, it presumes a solution to another problem—namely, the
problem of measuring the differences among items in a data set. In this work, we focus on the problem of defining a
difference measure and efficiently computing the nearest neighbors of a query point with respect to this measure.
Conventionally, this difference measure has been defined in terms of distance in Euclidean space—that is, the L2
norm. This difference measure is intuitive because it appeals to a geometric metaphor, for which nearest neighbors
are literally that.
The Euclidean difference measure is the crux of most approaches to the nearest-neighbor problem—that is, the
problem of computing the data points most similar to a given query point. Originally posed for two-dimensional data
[1], the Euclidean nearest-neighbor problem has been solved for low-dimensional spaces using multidimensional
index structures [2].
Unfortunately, these indexing methods do not perform well for the nearest-neighbor problem in high-dimensional
spaces [3, 4, 5]. As a result, it may be more effective to scan the entire data set rather than use sophisticated data
structures, if only for the sake of simplicity [6].
Moreover, recent work suggests that Euclidean distance is not an adequate model for high-dimensional data. Under
a broad set of conditions, as dimensionality increases, the expected value of the Euclidean distance from a query
point to its nearest neighbor approaches the expected distance from the query point to its farthest neighbor, thereby
making the notion of a nearest neighbor meaningless [3].
Hinneburg et al. address this high-dimensional modeling problem by projecting the data set onto a low-dimensional
subspace whose dimensions are chosen based on the local neighborhood of the query point [7]. They develop a
quality criterion for this subspace using kernel density estimation [8], and then attempt to optimize with respect to
that criterion using a heuristic that combines greedy and genetic algorithms.
Even this last method does not fully address the nearest-neighbor problem. Most significantly, it assumes that the
n
data points are—or can be modeled as—points in R for some set of n real-valued dimensions. While this
representation makes sense for real-valued data, it is cumbersome for categorical data that naturally map to points in
n n
{0,1} . Categorical data can be—and often are—treated as points in R , but this representation can result in a high-
dimensional real-valued representation that fails to exploit the sparsity of the data with respect to those dimensions.
1
There has been other work on non-Euclidean measures for categorical data, most notably the Value Difference Metric
(VDM) of Stanfill and Waltz [9]. The VDM assumes that each data point is a vector of attribute-value pairs, and
measures the distance between values for a given attribute. This inter-attribute measure is used to derive a distance
measure for data points.
2 Assumptions
We assume that where each data point is a set of categories—in particular, a subset of the universe of n categories
{c1, c2, …, cn}. A category is simply a symbolic data element.
We rely on the frequency distribution of categories in the data set in order to determine the relative importance of the
categories in computing the differences among items in the data set. Our approach contrasts sharply with
approaches that editorially assign weights to categories.
Finally, we assume that all of the categories are relevant to determining the similarity among items in the data set. In
fact, this assumption is a corollary of the previous ones. In most practical applications, our approach may require a
pre-processing phase that standardizes the data representation and eliminates irrelevant categories.
3 The Difference Measure
Let us consider, as an example, a database of university alumni that associates alumni with demographics and
summaries of their studies. For simplicity, we will assume that each of the alumni is associated with his or her gender,
nationality, degree, and field of study. The table in the Appendix shows a sample of rows from such a database.
We can immediately make several observations about the categories. The two gender categories do not provide us
with much information: roughly half the alumni are of each gender. Nationality is much less uniform: although most
of the alumni are from the United States, other countries are represented as minorities. Similarly, most of the degrees
are Bachelor’s degrees, with fewer Master’s degrees and even fewer PhDs.
From these observations about the category frequencies, we can make a first cut at the relative importance of the
various categories in determining differences among the data points. The gender categories are fairly unimportant:
knowing that two alumni are of the same gender tells us little when half of the alumni in the database are of that
gender. Nationality tells us little for alumni from the United States; it is more interesting, however, when two alumni
are from the same country other than the United States. Similarly, it is more interesting when two alumni are both
PhDs than when they both have Bachelor’s degrees.
Information theory tells us that a category that occurs with probability p conveys log 21/p bits of information.
Accordingly, infrequent categories convey mo re information than frequent ones, and a category that occurs with
probability 1 conveys no information at all. We thus assert that, all else equal, two alumni with an infrequent, high-
information category in common are more similar than two alumni with a frequent, low-information category in
common. For example, given no further information, two alumni with PhDs are more similar than two alumni with
Bachelor’s degrees, or than two alumni who are both male.
The single-category probability distributions, however, only give a rough first approximation to the joint probability
distribution of categories. As we can see from the data in the Appendix, the category distributions are by no means
independent. For example, alumni who studied business administration all have Master’s degrees. The
dependencies among categories can be subtle: alumni from the United States who studied computer science are
disproportionately male, but the gender ratio is even for alumni from India.
2
The dependencies among categories prevent us from obtaining a meaningful difference measure if we treat each
category independently. Even if we attempt to normalize the categories by assigning weights to them (i.e., higher
weights to categories that occur less frequently), we still fail to capture the dependencies. For example, Female and
Computer Science are both frequent categories, but their combination is far less frequent than one would expect if
the two categories were independent. It is here that the Euclidean difference measure suffers because it makes the
implicit assumption that the categories be treated independently of one another
Our categorical model allows us to take a simpler and cleaner approach. We first define, for a given data set D and
category set s, Ds to be the subset of D associated with all of the categories in s. For example, DØ = D, since the
empty category set Ø does not filter out any data points in D. In the alumni data base, D{Male} is the set of all data
points corresponding to male alumni, and D{Math, PhD} is the set of all data points corresponding to alumni who obtain
PhDs in Math.
We then define the difference measure ∆ to determine the difference between data points x and y:
∆ (x, y) = log2 |Dx∩y | / log 2 |D|
A few examples from the alumni database illustrate how this difference measure works.
∆ (x, x) = 0, assuming that there is no data point y distinct from x associated with all of the categories that are
associated with x (we can satisfy this assumption by assigning a unique identifying category to each data point).
Hence, Dx∩y = Dx = {x} and log 2 |Dx| = 0.
If x ∩ y = Ø, then ∆ (x, y) = 1, since D = D. In other words, if two data points have no categories in common, then
Ø
they are maximally different.
If x ∩ y = {Male}—that is, x and y are male but have nothing else in common—and half of the alumni are male, then
∆ (x, y) = log2 |D{Male}| / log 2 |D| = log 2 (|D|/2) / log 2 |D| = 1 – 1 / log 2 |D|. For |D| = 1024, ∆ (x, y) = 0.9.
If x ∩ y = {Female, PhD, Computer Science}—that is, x and y are female computer science PhDs from different
countries—and only 1 / 512 of the alumni are associated with all of these three categories, then ∆ (x, y) = 1 – 9 / log2
|D|. For |D| = 1024, ∆ (x, y) = 0.1.
In summary, the difference measure ∆ provides a data-driven measure of the difference between two data points,
ranging from 0 to 1. We note that ∆ is a measure but not a metric, as it does not satisfy the triangle inequality. It is
easy to find x, y, z such that ∆ (x, z) > ∆ (x, y) + ∆ (y, z). Nonetheless, this measure captures a notion of similarity that
is driven by the data distribution.
What is the cost of computing ∆ (x, y)? Since log2 |D| is constant for a given data set D, we only need to compute
|Dx∩y |. More generally, we are concerned with computing Ds for a given category set s.
We assume that D is represented using a suitable index that allows us to obtain the sorted list of data points
associated with a given category c in constant time. Given this data structure, we can compute Ds—and hence |Ds|—
in O(|s|•min(|D{c i}|)) time by intersecting the sorted lists—that is, in time proportional to the product of the number of
categories in s and the number of data points associated with the least frequent category in s.
4 Finding Nearest Neighbors by Brute Force
We now present a few brute force approaches to computing the nearest neighbors of a data point x with respect to
the difference measure ∆ (the next section discusses a more efficient, but approximate, approach).
3
The simplest approach is a sequential scan through the data. For each data point y in D, we compute ∆ (x, y) and then
sort in increasing order of the ∆ values. This approach requires |D| computations of the ∆ function, and is only
practical for moderately sized data sets (e.g., less than 10,000 data points) or for applications that do not require
interactive response times.
Another approach is to enumerate through the subsets of the categories associated with x. For each of the 2 | x|
category subsets s, we compute Ds—that is, the set of data points in D that are associated with all the categories in s.
For each data point y in D, ∆ (x, y) is equal to the cardinality of the smallest set of data points that contains y and
corresponds to one of the 2 | x| enumerated category subsets. Every data point y in D is in at least one of these 2 | x|
subsets, since DØ = D. This approach requires 2 | x| computations of the ∆ function, and outperforms the sequential
scan when 2 | x| is less than |D|.
A refinement of the latter approach is to use a priority queue to enumerate the subsets in increasing order of their
difference from x. We take advantage of the fact that, if s1 is a subset of s2, then |Ds1| = |Ds2|. Accordingly, we can
use a priority queue to enumerate the subsets in increasing order of the cardinality of the corresponding sets of data
points.
The priority queue initially contains the single subset x with priority |Dx| = 1. On each iteration, we remove from the
queue the subset s with the smallest priority value and compute all subsets that can be obtained by removing a single
category from s and that we have not already dequeued. For each of these subsets s, we compute its priority Ds and
place it on the queue with that priority.
This process will dequeue all of the subsets of x in increasing order of the cardinality of the corresponding sets of
data points. If we are only concerned with the nearest neighbors of x, then we may be able to compute significantly
fewer subsets than if we enumerated through all of the subsets of x.
These brute-force approaches are appropriate for data sets where either |D| is small enough to allow a sequential scan
or the category sets associated with data points are small enough to allow an enumeration of subsets. The subset
enumeration approaches do not require that the universe of categories be small, but only that each data point be
associated with a small number of them.
For large data sets where each data point is associated with even a moderate number of categories (e.g., |x| = 20),
however, these approaches become impractically slow. In fact, for very sparse data sets, the priority queue approach
can suffer from a peculiar form of the curse of dimensionality, in which most subsets s of x have the property that Ds
= Dx = {x}. In this situation, the priority queue approach will churn through most of the 2 | x| category subsets in
search of a single neighbor distinct from x.
In fact, the problem of determining a maximal subset s of x such that |Ds| > 1 can be shown to be an NP-hard
optimization problem by reduction to the minimum set cover problem [10], since it is equivalent to finding a minimal
subset s of x such that |Ds| = 1.
5 Finding Nearest Neighbors using Random Walks
Given the apparent intractability of an exact nearest-neighbor algorithm for large data sets where each data point is
associated with a large number of categories, we turn to a probabilistic approach that performs random walks on the
data set D.
This random-walk process starts out at a maximally general state—namely, the entire data set D—and then
progressively narrows towards the query point x by sequentially specifying categories in x. At some intermediate
point in this narrowing process, however, it randomly picks any data point that is associated with all of the categories
that have been specified thus far.
4
The following algorithm implements this approach:
1. Initialize z, the state of the random walk, to be Ø.
2. If Dz = {x}, then return x and terminate.
3. With probability p, return a randomly chosen data point from Dz and terminate.
4. Otherwise, randomly choose a category in x - z, add it to z, and return to step 2.
Assuming that we choose a non-zero value for the constant p, this process will return each element of D with non-
zero probability. Moreover, this process will favor the data points closer to x, since they are more likely to stay in Dz
as we add categories to z. Unfortunately, this process may return x itself—which is certainly close to x but not a
useful result!
Our choice of p will determine the variance of the random-walk process. Setting p to be 0 will result in a variance of 0,
since the process will always return x. Setting p to be 1 will result in a variance of log 2|D|, since the process will
terminate the first time it reaches step 3 with z = Ø. Any choice of p between 0 and 1 will result in a non-trivial
random variable that returns data points with probabilities correlated to their distance from x.
While this algorithm is appealingly simple, a few variations make it more practical.
If we are only interested in neighbors y such that ∆ (x, y) = r for a given value r, then we can modify step 3 to not
terminate unless log 2 |Dz| ≤ r log 2 |D|.
To avoid the situation where the random-walk process returns x, we can modify step 4 to avoid choosing categories
in x - z that result in Dz = {x}. Doing so, however, is not always feasible; z may be such that the addition of any
category results in Dz = {x}. We can also modify step 3 to always choose a data point other than x from Dz.
To speed up the algorithm, we can bias the choice of categories in step 4 to favor categories that significantly reduce
the cardinality of Dz. For example, we can make the probability of picking a category c inversely proportional to |Dz U
{c}|. Doing so reduces the variance of the random walk, making it rapidly narrow in on the neighborhood of x.
Finally, for practical applications where we want to collect a number of nearby neighbors, we may prefer to replace
the selection of a single data point in step 3 with the selection of some or even all of the data points. This variation
makes most sense when we combine it with the constraint that we only terminate when |Dz| ≤ r for an appropriately
small value r.
The random-walk process acts as a black box that outputs each element of D with a probability that is correlated to
that element’s difference from x. Ideally, the probability would be monotonic in the difference; that is, for all y1, y2, if
∆ (x, y1) = ∆ (x, y2), then we would like for the random walk to return y1 with higher probability than y2. While the
random walk seems to exhibit this property in practice, the problem of proving exact or approximate properties about
the relationship between the probability distribution produced by the random walk and the ∆ function remains for
future work.
The main benefit of the random-walk approach is that its running time (measured in the number of times we compute
i
|Ds|) is l near, rather than exponential, in |x|. Even though we may have to perform the random-walk algorithm
repeatedly in order to obtain a set of neighbors, the number of iterations will hardly be exponential in |x|. In the
absence of a better analysis of the probability distribution produced by the random walk, we cannot make theoretical
claims about the necessary number of iterations. In practice, however, we make the number of iterations proportional
to the number of neighbors desired for the given query, using a small constant of proportionality (e.g., 10).
5
6 Practical Application: A Recommendation Engine
As discussed earlier, we assume that D is represented using an index that allows us to compute Ds in
O(|s|•min(|D{c i}|)). We can use a standard relational database management system (RDBMS) to implement such a
representation. Indeed, we only need a system that supports efficient read-only conjunctive queries to implement
either the brute-force or random-walk algorithms.
Endeca has implemented a variation of the random-walk algorithm using proprietary data structures that also support
efficient search and navigation of the data [11]. Endeca has used this implementation to provide a recommendation
engine for the roughly 40,000 movie titles in the Tower Records video and DVD catalog [12].
The recommendation engine associates each movie with its director, genres, and themes—often over twenty
categories in all. The number of categories per movie ruled out subset enumeration, while the number of titles made
sequential scan unacceptable. The engine, which is incorporated into the Tower.com website, is able to compute
recommendations in well under a second using commodity hardware.
Figure 1: Tower Records Recommendation Engine
Figure 1 shows an example of the recommendation engine’s output for the movie Regarding Henry. Two of the
movies recommended, Postcards from the Edge and Who’s Afraid of Virginia Woolf, have the same director as
Regarding Henry, Mike Nichols. Since only 16 titles in the catalog have this director, the director carries very high
information content.
6
Rain Man, in contrast, has a different director, but overlaps Regarding Henry in a rare combination of genres and
themes—in fact, these are the only two titles in the catalog that are associated with both the “Drama > Family Ties”
genre and the “Wise Fools” theme.
We have built similar recommendation engines for data sets representing books, music, wine, and financial data.
Most recommendation engines for online catalogs rely on collaborative filtering, which relies on correlating user
preferences from site traffic or purchasing history. In contrast, the present approach is data-driven, and does not
depend on user history, which can be unreliable for infrequently purchased items and may not even be unavailable
for catalogs with frequently changing inventory.
Recommendation engines are only one of many applications of our difference measure and nearest-neighbor
algorithms. Other applications include dynamic merchandising, data discovery, data cleansing, matching, and
clustering.
7 Conclusions and Future Work
We have presented a new way of addressing the problem of measuring the differences among items in a data set.
Our difference measure is data-driven, thus avoiding the problems that conventional measures like Euclidean
distance encounter because they treat each dimension of the data independently. We have presented brute-force
algorithms to compute nearest neighbors with respect to this measure, as well as a mo re efficient random-walk
algorithm that provides approximate solutions. The following are some of the future directions for this work:
Empirical validation. This new approach to the nearest-neighbor problem requires extensive empirical study to
validate both the meaningfulness of the measure and the performance of the proposed algorithms.
Theoretical analysis. While the random-walk algorithm is intuitively appealing and practically effective, our inability
to analyze the probability distribution of its output is frustrating.
Handling noisy data. Noise can have an unpredictable effect on the difference measure. Specifically, noise in the
data can lead to presence of incorrect categories or category combinations that appear to convey high information
because of their rarity. For example, repeated misspellings of common words may result in strings that occur only a
few times in the data set. To some extent, we can address this problem through data cleansing, but there remains a
need for work to determine which categories should be ignored for the purpose of computing the ∆ function. We are
considering an extension to the present approach in which we estimate the relevance of a category by estimating the
extent to which other categories depend on it. A category that is independent of all other categories and category
combinations is likely to be random noise.
Handling non-categorical data. We mentioned earlier that our approach could be used for partially ordered or even
numerical categories. For dis crete, partially-ordered categories, we can simply augment our representation of data
points by also associating each data point with all of the ancestors of its categories in the partial order. Continuous
numerical dimensions pose more of a challenge. To address numerical data, we need to bucket ranges in each
numerical dimension, using overlapping ranges to avoid horizon effects. The problem of selecting ranges is
essentially a one-dimensional clustering problem, for which there are many published algorithms. We feel that doing
so is a straightforward extension to the present approach.
Matching and clustering. Algorithms for matching and clustering represent, either implicitly or explicitly, the matrix
of distances among the data points. Our present approach applies directly to combinatorial approaches such as
bipartite matching algorithms, since we can use our difference measure to assign edge weights. Geometric
approaches like k-means, however, rely on being able to compute the centroid of a set of data points. Arguably, the
concept of a centroid simply does not make sense for categorical data. Nonetheless, it seems worthwhile to explore
how the present approach interacts with the various matching and clustering algorithms in common use.
7
Acknowledgements
The author would like to thank the anonymous reviewers for their useful comments and suggestions. The author
also thanks his colleagues at Endeca for supporting this work, and for providing insightful feedback on early drafts.
References
[1] M. Shamos and D. Hoey: “Closest-point problems”, Proceedings of the 16th Annual IEEE Symposium on the
Foundations of Computer Science, 1975, pages 151-162.
[2] V. Gaede and O. Gunther: “Multidimensional Access Methods”, ACM Computing Surveys, Vol. 30, No. 2, 1998,
pages 170-231.
[3] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft: “When Is `Nearest Neighbor' Meaningful?”, Proceedings of
the 7 th International Conference on Database Theory, Jerusalem, Israel, 1999, pages 217-235.
[4] R. Weber, H.-J. Schek, and S. Blott: “A Quantitative Analysis and Performance Study for Similarity-Search
Methods in High-Dimensional Spaces”, Proceedings of the 24th International Conference on Very Large Data
Bases, New York, 1998, pages 194-205.
[5] S. Berchtold, D. Keim, and H.-P. Kriegel: “The X -Tree: An Index Structure for High-Dimensional Data”,
Proceedings of the 22nd International Conference on Very Large Data Bases, Bombay, India, 1996, pages 28-39.
[6] U. Shaft, J. Goldstein, and K. Beyer: Nearest Neighbor Query Performance for Unstable Distributions, Technical
Report TR 1388, Department of Computer Science, University of Wisconsin at Madison.
[7] A. Hinneburg, C. Aggarwal, and D. Keim: “What is the nearest neighbor in high dimensional spaces?”,
Proceedings of the 26th International Conference on Very Large Data Bases, Cairo, Egypt, 2000, pages 506-515.
[8] B. Silverman: Density Estimation for Statistics and Data Analysis, Chapman and Hall, London, 1985.
[9] C. Stanfill and D. Waltz: “Towards Memory-Based Reasoning”, Communications of the ACM, Vol. 29, No. 12,
1986, pages 1213-1228.
[10] M.R. Garey and D.S. Johnson: Computers and Intractability, W.H. Freeman, New York, 1979.
[11] Endeca: Interactive Navigation of Large Data Sources, 2001, available at http://www.endeca.com.
[12] Tower Records online catalog: http://www.towerrecords.com.
8
Appendix: Sample Rows from a University Alumni Database
ID Number Gender Nationality Degree Field of Study
292163166 F Canada BA Music and Theatre Arts
298812843 F US BA Philosophy
305462520 M India PhD Computer Science
312112197 M India BA Computer Science
318761874 F US MA Philosophy
325411551 F US BA Philosophy
332061228 F India MA Computer Science
338710905 F India BA Computer Science
345360582 M US MA Philosophy
352010259 F US MA Computer Science
358659936 F India PhD Computer Science
364545078 F US BA Mathematics
365309613 F India BA Biology
371959290 M US MA Computer Science
378608967 M US BA Computer Science
385258644 F India MA Mathematics
391908321 M India BA Biology
398557998 M US PhD Computer Science
405207675 M US PhD Computer Science
411857352 F US PhD Biology
418507029 M US BA Biology
425156706 M US MA Computer Science
431806383 F US MA Business Administration
438456060 M US PhD Biology
445105737 F US BA Music and Theatre Arts
451755414 F Germany MA Business Administration
458405091 M US MA Business Administration
465054768 M US MA Music and Theatre Arts
471704445 M US BA Planetary Sciences
478354122 M Canada MA Business Administration
485003799 F Canada MA Business Administration
491653476 F US BA Planetary Sciences
498303153 F US MA Planetary Sciences
504952830 M US MA Business Administration
511602507 M US MA Business Administration
518252184 M US BA Planetary Sciences
524901861 M US BA Planetary Sciences
531551538 F US BA Mathematics
538201215 M Germany PhD Mathematics
544850892 M India BA Planetary Sciences
551500569 F Canada MA Planetary Sciences
558150246 M US MA Mathematics
564799923 F India BA Mathematics
571449600 F US BA Planetary Sciences
578099277 F Germany BA Planetary Sciences
584748954 M Germany BA Philosophy
591398631 F India PhD Philosophy
598048308 F Germany BA Computer Science
604697985 M Germany MA Computer Science
611347662 M US BA Economics
617997339 F US BA Economics
624647016 F US BA Computer Science
631296693 F Germany BA Philosophy
637946370 F India BA Computer Science
644596047 M Canada MA Economics
651245724 M India BA Computer Science
657895401 F US BA Mathematics
664545078 M Germany BA Mathematics
671194755 M US BA Biology
9
Get documents about "