Approximation Algorithms for the Metric k -Median
Paper type: Research [ ] Critical survey [ ] A. Variations of k-median
Main area(s): Theory of Computation In addition to k-median, as described above, there are some
Abstract—A short overview of recent results in clustering
naturally similar problems that arise, either due to the manner
related algorithms. Exact algorithms and approximations for k-
median of various quality are discussed, as are variations of this by which the data was collected or the manner by which the
problem for different input and space/time requirements. Natural data will be used. Variations of k-median include:
alternative formulations of k-median are covered as are related • Online k-median. In online algorithms, some part of the
problems. Also discussed is the application of such solutions to input is unknown in advance. In the case of online k-
various information-centric uses. This paper concludes with a
discussion about clustering metrics as a whole. median, the full metric space is known, but k is not. The
Index Terms—clustering, survey paper, k-median, streaming goal is to determine a permutation of the points such that
algorithm, sampling algorithm, approximation algorithm, linear the ﬁrst k serve as a good solution to k-median for that
programming, data mining, operations research k, for all k. This can be thought of as asking for k in
an online fashion. In related formulations, demand points
I. I NTRODUCTION arrive online and k can be increased for a cost. The latter
The past decade has seen signiﬁcant advances in approx- model is more common for facility location.
imating several prominent NP-Hard problems from the ﬁeld • k-median with diameter, in which the maximum distance
of operations research. At the center of this is the k-median between two points in N is ∆ and the minimum distance
problem, in which an appropriate partitioning of the data set between two is δ. This effectively limits the amount by
into k representative clusters is sought. This is one of several which a point can affect the optimal cost.
metrics available to quantify the quality of a clustering. • Prize-collecting k-median, in which some points may be
In this survey, I give an overview of many of the applications excluded from the solution cost, either up to a limit or by
of k-median and related problems, as well as several forms paying an arbitrary cost to not assign these to any of the
of algorithmic solutions to natural formulations of the task. k medians. This is sometimes known as k-median with
In section 2, I provide a formal deﬁnition of the k-median penalties. This puts an upper bound on the cost by which
problem as well as many variations of it. In section 3, I explain any point can affect the optimum cost.
applications of k-median: why solutions to the problem are • k-median with outliers, in which some fraction η of
useful in practice. In section 4, I overview more than a the points don’t count towards the cost of the solution;
decade’s worth of results related to the problem. In section 5, this prevents small, distant clusters from signiﬁcantly
k-median under alternate forms of input is explored. Section 6 affecting the approximation guarantee.
discusses alternate forms of the problem. Section 7 discusses • Capacitated k-median, in which each median can have
problems that are similar to k-median. I conclude with a at most C data points in its cluster. This is sometimes
discussion of alternate notions of clustering. known as robust k-median.
II. P ROBLEM S TATEMENT B. Similar problems to k-median
Given N , a set of points in some metric space, and some Similar problems to k-median include k-means, k-center,
integer k, our goal is to select k points, K ⊆ N . Once the and facility location. Approaches from these can be used to
points are selected, the solution quality is the sum of all N ’s assist results in k-median and vice versa, and as such, are
points’ distances to their nearest element of K; a higher quality worth considering when studying k-median.
solution corresponds to a lower cost. The set of data points • In k-center, the goal is to select k points so as to minimize
that have a given m ∈ K as their closest is known as m’s the maximal distance from any given point to its cluster
cluster. Point m is the cluster’s center, or median. center. This can be seen as setting up Wireless Access
The k-median problem is NP-Hard and can be approximated Points and minimizing the broadcast radius for all served
both by solution cost and by the number of medians produced points.
by the solution. In general, an [α, β]-approximation for k- • In k-means, the objective function to minimize is the sum
median guarantees a cost of at most αOPT and uses at most of squares for the distance from each point to its center.
βk medians. If an approximation factor is listed as a single In Euclidean space, the 1-mean corresponds to the center
expression, the algorithm uses exactly k medians and the of mass, although the 1-median does not have such a
expression is the cost ratio to OPT. closed form  in two or more dimensional space.
• Facility Location is similar in goal to k-median, but answer database queries for which traditional relational models
without an explicit bound on the number of medians that lack the expressive power to ask in the ﬁrst place or for which
may be selected. Instead, each point that isn’t a median it is difﬁcult for a user to pose a speciﬁc query. Fraud detection
pays a cost to connect to its nearest median, as before, and event cataloging fall into this category. It can also be used
and any point may be designated as a median by paying to alleviate the so-called “write-only memory” phenomenon
a given cost. Facility Location is used as a subroutine in that occurs when large amounts of data are gathered, making
many k-median solutions. it difﬁcult for humans - and most algorithms - to read and
The total cost of a facility location solution is split into understand .
the service cost paid to connect points to medians and the
facility cost, incurred by opening new medians (facilities). IV. R ESULTS ON k-M EDIAN
It is worth comparing, brieﬂy, the similarities of k-median The k-median problem has been studied in many forms. In
and facility location with uniform facility costs. An α- 1992, the ﬁrst polynomial-time approximation algorithm with
approximation for k-median is itself an α-approximation provable performance guarantees was discovered. Nine years
for that version of facility location, as a search can be later, the ﬁrst constant approximation was published.
done for the best value of k. However, even an exact
algorithm for that version of facility location does not A. Hardness Results
give a solution for k-median. In a sense, this makes k- The non-metric k-median problem is NP-Hard and is as
median the harder problem. hard to approximate as Set Cover, and as such, no approxima-
tion better than O(log n) may exist unless P = NP. The same
III. A PPLICATIONS OF k-M EDIAN is true for non-metric facility location .
The need for efﬁciently locating sources of supply causes The metric k-median problem is NP-Hard and Max-SNP
demand for accurate solutions to the k-median problem and its Hard. Approximating k-median within ε and maintaining
variants. Efﬁcient location permits quick and cheap transport- exactly k medians is also NP-hard.
ation to customers from warehouses and improved customer Online solutions in the model of k arriving online cannot
relations due to speed of delivery  . The solution is 2
achieve better than a 2 - n−1 -competitive ratio due to the
of concern to the ﬁrm(s) in question, as it is both a guide star graph with unit-cost edges. The center must be chosen as
for maximizing proﬁts and is instructional for infrastructural the ﬁrst point, otherwise 1-median would be n-competitive on
investment . It also can be used by analysts both interior that graph, and the optimal solution for n − 1-median involves
and exterior as a measure of the efﬁciency of the ﬁrm or selecting everything except the center piece .
of the industry as a whole . Minimizing the distance to The k-median problem in general graphs is (1+ 2 ) ≈ 1.736-
facilities from serving locations can increase the speed with hard to approximate . Facility location over the same is
which inventory can be replaced at sales locations, and thus 1.463-hard to approximate .
drive up sales .
Facility Location is similarly used, as it models the fall B. Optimal Solutions on Special Metrics
in shipping cost associated with increased warehousing while Tamir  gives an O(n2 k) dynamic programming algo-
balancing total cost . This is convenient as most geograph- rithm to solve k-median on tree metric. The algorithm pro-
ical systems can be represented as these two basic elements, ceeds from the leaves of the tree up to the root. He observes
and it is quite apparent when a situation is described in that it gains the same time bound when applied to forest
economical terms . metrics. He cites a previous result showing an O(kn) time
For some input data, such as customer transaction records, a algorithm for instances in which the input graph is a path. In
k-median solution can be interpreted as k “typical” customers’ each of these three cases, the solution is optimal and solvable
information. This is great for targeted advertising (by discov- in pseudo-polynomial time.
ering market segments) or a stockholder summary. Similar The usefulness for tree metric solutions isn’t limited to cases
applications exist for large data sets, such as web pages or where the metric is given as a tree. A common application
phone records. Phone record clustering is useful in detecting of this result is that it is used to form approximations for
telephone fraud . Capacitated k-median can be used to general metrics by embedding those metrics into trees and
determine locations for setting up proxy servers on the web. then solving the problem optimally on the tree or on the
Problems in the form of k-center are particularly important probability distribution over trees produced. This results in an
for cases when each distance from a distribution center, rather approximation factor of exactly the distortion produced by the
than total distance, causes additional expenses or hardships. embedding. In 1996, Bartal  provided a method to embed
Perishable goods, such as fresh milk, require refrigeration to any metric space into a probability distribution over trees in
transport over distance . such a way that the expected distortion is O(log2 n). That
Clustering is also useful in machine learning and classiﬁ- is, if d(x, y) is the distance between x and y in the original
cation as similar cases can be grouped either into an epitomical metric, the distance between them in the newly formed tree
sample or into groupings of the same, permitting quick class- metric is expected to be O(log2 n)d(x, y). The trees formed
iﬁcation of future items . Clustering can also be used to are k-hierarchically well-structured trees (“k-HSTs”). They are
rooted, weighted trees formed in such a way that, for all nodes, D. Bicriteria Approximations
the weight of the edge to any given child is the same and are
a decrease of at least a factor of k from the edge that extends The ﬁrst polynomial time approximation algorithm with
from its parent to it. a provable performance guarantee is due to Lin and Vitter
, . Their algorithm provides a [1 + ε, (1+ 1 )(ln n + 1)]
In 1998, Bartal  improved this to O(log n log log n)
approximation on k-median. They also have a [2(1 + ε), 1 + 1 ] ε
distortion and also provided a deterministic version of the
approximation. The latter requires additional input in the form
result. This was done by instead computing hierarchically
of a bound on the optimal cost. While this approach is based
partitionable metrics (“HPMs”), on which the graph is given
on linear program relaxation as a starting point, it does not
lengths such that, on a cut in the graph, the lengths crossing
employ traditional rounding ; for example, some variables may
the cut are at least as much as the diameter and then each
be set to zero in the relaxed linear program solution but be
partition formed by the cut obeys this rule as well, with a
set to 1 by the algorithm before the solution settles. They do
(naturally) smaller diameter. The HPMs are then translated to
this by ﬁltering the results, guaranteeing that each vertex is
assigned to a center in its approximate neighborhood.
There exist metrics for which any tree embedding must have
Korupolu et al  provide an analysis of k-median solved
Ω(log n) distortion, and for a few years, the gap remained
by a local search heuristic. Local search is a technique that
open in general graphs. In 2003, Fakcharoenphol, Rao, and
starts with some plausible solution, perhaps obtained by an-
Talwar  closed the gap by providing embeddings of
other approximation algorithm or by an arbitrary starting point,
general metrics into trees with O(log n) distortion. Rather than
and incrementally improves the solution until no immediate
constructing the set of trees directly, as Bartal had done, they
improvement can be found. In the case of k-Median, any
decompose the graph by growing clusters’ diameters, cutting
subset of N of size k is a possible solution, although it is
crossed edges with probability based on the length of the
not necessarily a good one. It is possible to improve the
edge relative to the diameter. Once the graph is decomposed
solution by taking some p medians (1 ≤ p ≤ k) and p non-
in this manner, the tree distribution can be grown from the
medians and “swapping” them, making the medians into non-
medians and vice versa. All n data points are then reassigned
Before the gap was closed, Charikar et al  gave an to their nearest of the new set of medians and the solution
approximation by derandomized the process that produced the cost is updated. If any of the possible improvements leads to
tree distribution. They did this by solving the linear program an improved solution cost, that becomes the new temporary
relaxation on the original problem and building a set of trees solution; in the event that multiple lead to improvement, the
such that the fractional solution is exact on the distribution of best improvement is chosen. If none lead to improvement, the
trees produced. The rounded solution from here produces an solution whose cost could not be improved upon is the ﬁnal
O(log n log log n) approximation, which can be improved to result. Note that this ﬁnds the local optimum, which is not
O(log k log log k) with an idea from . As this was achieved necessarily the global optimum. The approximation guarantee
prior to the O(log n) embedding of , there is no explicit is based on the worst-possible local optimum as compared to
claim to whether or not O(log k) is possible through this the global optimum. This is known as the locality gap.
approach. I believe it is possible, although I leave it unproven
Korupolu et al consider local search with one swap and
as solutions to far better factors than O(log k) exist for k-
provide a guarantee of user’s choice of [1 + ε, 10 + 17 ] or
[Θ(ε3 ), 5 + 1 ]. The guarantee is formed by showing that for
any solution state sufﬁciently more expensive than the optimal,
C. Arbitrary Precision within Special Metrics there is a state one swap away that has a lower cost. This
essentially proves a bound on the locality gap, thus showing
Arora et al  give the ﬁrst arbitrary precision approxima- that upon termination, local search is within that bound of
tion for planar k-median in the form of a PTAS. Their solution optimal. This also helps to show that it takes polynomial time
takes time nO(1+ ε ) and has an expected approximation ratio to run, as there are a polynomial number of swaps possible
of 1 + ε with constant probability. Their algorithm works at each stage and polynomially-many that can happen before
by dividing the plane into a point-quadtree and using a convergence. The chosen bound β on the number of medians
dynamic programming approach on that result, limiting how determines the starting position of the local search: any set of
assignments may be made outside of their individual sectors βk points.
within the quadtree. This is a generalization of the approach Indyk  gives an approximation that produces a [(1 +
Arora used to solve Euclidean TSP to arbitrary precision. γ)3(2+α), 2β] approximation with probability Ω(γ), for some
In the event of a small value of k, a slightly different conﬁdence parameter γ > 0 and some [α, β] approximation
approach may be used. Instead of restricting association out ˜
for k-median that runs in time O(n2 ), such as any mentioned
of a segment of the quadtree, the median locations themselves in this paper with such a guarantee. The error probability
may be approximated. This results in a 1 + ε-approximation can be reduced to a constant with O( γ ) repetitions of the
with runtime n(O(log n ))2k , a notable improvement when k
ε algorithm, taking the smallest cost solution produced. The
is small. ˜ 2
runtime of a single iteration is O( kn ). It works by taking
a single, sufﬁciently large (O( nk)) sample of points in N Note that each time we call facility location, we have essen-
and running the provided [α, β] approximation algorithm on tially relaxed the requirement that exactly k medians be opened
the sample. From the ﬁrst sample, the fraction with the largest by moving the requirement to the objective function as a cost.
values for distance to their medians are separated and black- This is the essence of Lagrangian Relaxation, introduced by
box algorithm is run again on these. The solution is the output Jain and Vazirani , and applied to produce an approximate
of both runs of the black-box algorithm, for a total of 2βk solution. The realization for this came by observing that the
medians. k-median algorithm of Lin and Vitter uses ideas from constant-
factor approximations for facility location. The Lagrangian
E. Constant-factor Approximations Relaxation technique formalizes the relationship between these
The ﬁrst constant factor approximation that is exact on the A problem arises in this, however, because facility location
number of medians used is due to Charikar et al  in 2001. isn’t solved exactly by our subroutine: if we were going to
The algorithm produces a 6 2 -approximation by linear program
3 spend the time to solve it exactly, we could just solve k-median
relaxation. From the linear program solution, a collection of itself directly. Furthermore, it isn’t immediately obvious that
trees are built and the problem is then solved optimally on the an approximate solution to facility location on the new cost
newly formed forest metric. metric is a bounded approximation on k-median. We need
More speciﬁcally, they begin by consolidating nearby data to be more careful. In this case, if a solution with exactly
points, modifying the demands so as to not change the linear k medians isn’t found, either due to precision or a deliberate
program solution. Without changing the cost of the fractional decision to limit the number of iterations, we will still have two
solution, they are able to form a semi-sparse data set, with surrounding cases, one with more medians and one with fewer.
all positive demand points apart from one another. They then Accepting a small loss in the approximation factor, the two
modify the linear program solution on these to consolidate can be combined via a convex combination and randomized
nearby fractional centers. The rounding, viewed as a solution rounding. In general, the Lagrangian Relaxation technique, as
over a forest metric, produces the approximate solution. described by Jain and Vazirani, is to take any hard constraint,
Jain and Vazirani  present a 6-approximation. This, on such as the number of medians, and move it from the linear
its own, is not the key important contribution of their paper, as programming constraints to the objective function, adding it
Arya et al gave a 3 + ε-approximation the same year. Jain and (times some multiplier) to any objective function already there.
Vazirani, however, introduce the use of primal-dual schema to The problem is then solved as a subroutine, either explicitly or
incorporate into the algorithm, instead of relying on solving a using a known result for the relaxed problem, and the solution
linear program explicitly and forming an integral solution from is interpreted as a solution to the original.
there. In addition to giving an improved runtime, this technique Lagrangian Relaxation can also be applied to solve prob-
also permits improvements based on the structure of the lems where two related problems are joined by a k-median
problem - an advantage absent in traditional linear program- type constant. Jain and Vazirani  give the example of
based solutions. This leads to both more efﬁcient solutions and needing two types of facilities (schools and hospitals) and
solutions that are strictly combinatoric - eliminating the linear clustering by connecting points to one of each of these, but
program-based portions of the algorithm. This technique can only being able to build k total facilities. The Lagrangian
apply to other problems as well and is thus relevant outside Relaxation of this problem becomes two facility location
of k-median. problems.
Another technique may well be to use facility location as a Charikar and Guha  improve the result of  to
subroutine. For illustration purposes, let us ﬁrst consider how ˜
give a 4-approximation in O(n3 ) time. After the small and
this would work if we were to attempt to solve the problem large solutions are found, an additional augmentation step is
exactly. Our goal is to ﬁnd k medians, and facility location performed by moving medians from the large solution to the
does ﬁnd medians, but it does not allow us to specify an small. This is made possible by their observation that the dual
explicit limit on the number of medians. Instead, it allows variables are continuous and small changes in the facility cost
us to specify the cost per facility. We can use a solution to are bounded in their effect on the dual variables.
facility location as a black box and do a binary search on A better approximation was found later the same year by
the cost per facility input in order to ﬁnd a cost that yields Arya et al  using local search, improving over the previous
exactly k facilities. For any given attempt at cost per facility, local search heuristic of . Arya et al show that if p swaps
if fewer than k facilities are opened, we can conclude that are allowed at each stage, the approximation guarantee is 3 +
the facilities were overpriced and lower the cost for the next p . With only one swap permitted at each stage, this is a 5-
iteration (using a classic principle from economics: to sell approximation, with additional swaps taking additional time
more of something, lower the price). Alternately, if facility but providing an improvement on the solution quality. Any
location reports too many medians, then the price of facilities amount of swaps per stage yields an improved solution quality
is too low. Eventually, assuming facilities may have any real over the 6 3 -approximation of .
cost, we will converge on a solution in which k facilities are The proof of the single swap case revolves around the
opened; this is our solution. idea of “capturing” an optimal median. An optimal median is
captured by some element in some solution if more than half solution along the way as requested. In addition to being useful
of the points assigned to the optimal median in the optimal for information that may literally be streaming in real time,
solution are being serviced by the capturing point. Note that these algorithms are also useful for cases where random access
the capturing point is bad for us. We are fortunate in that to data is expensive or non-existent or where multiple reads
optimal facilities can be swapped in exactly once and any point are prohibitively expensive. It also applies to database-related
that captures exactly one optimal point will be swapped out cases when the query to be executed references more data than
for the optimal point by local search. Furthermore, if a point can ﬁt in main memory .
captures two or more optimal points, it will not be swapped The streaming algorithm of Guha et al  reads the
out by single swap - it is these points that cause the locality data m points at a time, clustering them using some [α, β]
gap. Using these facts, an upper bound on the cost increase bicriteria approximation algorithm, and repeats, remembering
by swaps made or not made can be proven. Combining this the weighted medians as “level-1 medians.” When m weighted
with the limit on how many swaps may occur during a run of medians are stored at level-i, the weighted points are clustered
local search gives us the locality gap. into βk level-i+1 medians. When the program is done reading
For the case of multiple swaps, the notion of capture is all of the input - or we wish to cluster based on what has been
extended to any subset of the current solution state. A subset seen so far, all medians seen so far are clustered into k ﬁnal
captures a set of optimal points: namely it captures any optimal medians. The result is a constant approximation.
point for which the subset covers half or more of the points that The paper also gives a one-pass O(nk log n) time and
the optimal point covers in the optimal solution. Any subset O(nε ) space algorithm. It uses the local-search algorithm
making a capture is termed bad, as before. Note that for any of Arya et al  as a subroutine and produces a constant
subset of size one, the deﬁnitions of capture are identical. approximation ratio for small values of k.
Because each iteration of the local search is able to remove Charikar et al  improve on Guha’s streaming result.
a bad median from the current solution state, and the number Their streaming algorithm is O(kpoly log n) space and is a
of times a median can be swapped in or out continues to be constant approximation with high probability. It uses online
2 facility location, due to  and described later in this paper,
bounded, the approximation guarantee of 3 + p can be shown.
as a subroutine. By bounding facility cost and the quantity
V. A LTERNATE F ORMS OF I NPUT of medians (facilities) produced by the online facility location
As k-median is important to large-scale data mining, it has algorithm, they are able to get performance guarantees for
also been studied as an online problem, as streaming, and as their streaming algorithm. To guarantee a constant-factor ap-
sampling, all of which are necessary due to the exponential proximation on cost from the algorithm, O(k log n) medians
growth in available data. are selected instead, as they cannot shufﬂe the input to get
a constant approximation. When the data stream has been
A. Online solutions processed in its entirety, the O(k log n) weighted medians are
Mettu and Plaxton  provided a constant approximation consolidated into exactly k. The algorithm takes one pass over
for the online k-median problem. The algorithm runs in time the data, provides a constant approximation with probability
O(n2 +nl), where l is the number of bits necessary to represent 1 - poly(n) , and uses only the space necessary to represent
the distance between two points in the metric space. They O(k log2 n) points from the stream.
show that a traditional greedy algorithm, selecting the next Charikar et al also describe a solution to the asymmetric
median so as to minimize the objective cost given the previous k-median problem, in which it is not guaranteed that d(x, y)
pieces, has an unbounded competitive ratio. Theirs, instead, is = d(y, x), although reﬂexivity and directed triangle inequality
hierarchical greedy: it picks a point greedily, but rather than continue to apply. Their non-streaming solution is essentially
settle on that point, it conﬁnes its next search to that point’s to grow radii among points to cover additional uncovered
neighborhood. It repeats this process until there is only one points at the cheapest cost. This is similar to a greedy
point left to pick; this takes O(log ∆ ) repetitions. While the
δ approach used for k-center and works solely if ∆ is known
constant-competitiveness of this algorithm is useful in its own in order to give a size by which to iterate. This provides an
right, it is also of particular use in that it can be run once and [O(1), 2k log ∆ ]-approximation.
then queried for multiple values of k without reprocessing the Their streaming result for the same is done by showing
input. that the non-streaming can be modiﬁed to take less memory
and proceed over O(log ∆ ) passes. The new algorithm uses
B. Streaming-based solutions O(k 2 log ∆ ) space.
In many cases, it is not practical to store all the data and Streaming solutions are also applied to cases where old
read it repeatedly. For instance, if customer data, such as data needs to expire and is no longer considered relevant.
purchase information, were desired to be clustered and the Babcock et al  present an algorithm for k-median over
then-current k medians output nightly, reading all the previous sliding windows, in which only the R most recent points are
data before clustering becomes prohibitively expensive and relevant, can be chosen as medians, and affect the optimal
redundant after a short period of time. Streaming algorithms solution. Any data points before the R are deemed to have
maintain an approximate solution state and can output a partial expired. Furthermore, the memory requirements of streaming
remain in effect. Their initial solution takes O( τk4 R2τ log R) assumptions. First, the assumption must be made that data
memory and produces a [2O( τ ) , 2]-solution for τ < 1 , chosen
points can be selected uniformly at random without needing
by the user for trade-off purposes. This can be extended to to read the entire input in the process; if this assumption does
produce exactly k medians with the same approximation factor not hold, the algorithm might as well be a streaming algorithm,
and memory requirement. As in the algorithm of Guha et al, as the reading the data on its own will take linear time. Also, if
medians at intermediate levels are maintained, as are buckets the sample is not obtained by randomness, an adversary could
for relevance of oldest to newest, for expiration purposes. Each deliver a misleading sample that ignores even big clusters. It
bucket is split into the different levels, as per Guha’s algorithm. is also necessary to assume that the data is either clustered by
An overestimate of the clustering cost of the bucket is also OPT to be reasonably large (Ω( n ) in size) or that the problem
maintained. Each median must also track approximately how being solved is actually k-median with outliers. Alternately,
many active points are in its cluster. When groups of medians the diameter of the problem could be small. If these are not
are clustered, they are weighted according to the number of the case, then a small cluster of outliers could hide from the
active points each contains. sampling and a linear scan of the data would need to detect
For the case of streaming data that arrives in an online fash- this, again defeating the beneﬁts of sampling. This assumption,
ion, this can be extended with the same memory requirement in any form, essentially prohibits a small subset of the points
and O(k) time to update at each additional arrival point. from unduly affecting the optimal solution. Finally, in any
sampling-based solution, the k medians themselves must be
C. Sampling-based solutions
an acceptable output to the program, as outputting an explicit
With the rapid growth of stored data, sometimes even linear- mapping N → K would take O(n) time, even if the solution
time algorithms are impractical. As such, it becomes important were magically produced.
to condense data for consumption and analysis. Sampling- Mishra et al  were the ﬁrst to apply this technique to get
based algorithms solve this by guaranteeing an approximation an approximation algorithm for k-median that runs in sublinear
ratio in time sublinear in the size of the data. As this prohibits time. Speciﬁcally, they show that, given an α-approximation
even a full reading of the data, assumptions about it must to k-median (such as, say, any algorithm from Section IV-E
be made, such as those about the quantity and importance of of this paper), they can produce a 2α + ε-approximation with
outliers as well as the distribution of the relevant data. probability arbitrarily close to one. If the metric is Euclidean
For any sampling solution, the following result is useful: space ( d ), the approximation factor is α + ε.
Let S ⊆ N and K ⊆ S be the optimal k-median The algorithms they present assume also that the diameter
solution on S. K is a 2-approximation on what the of the space is known, although this can be estimated if it
k-median solution on S would be if any point from is not known a priori. Because any sample of sufﬁcient size
N , rather than only the points in S, could be selected will behave like the full set for some classes of functions, and
as medians when solving k-median on S. because they are able to show that computing the diameter is
A version of this fact is cited in  and in . The fact is such a function, computing the diameter of a sufﬁciently large
useful in any sampling solution to k-median for two reasons: sample provides an estimate, if such is needed.
First, if we consider N to be a subset of an inﬁnite metric To prove the approximation bounds, they show that the ratio
space - which it is - then any solution on N is only a factor of the average distance of a point to the closest approximate
of two away from an unconstrained solution and has a much median in the sample converges quickly to the same ratio over
more conﬁned search space. For example, if we are trying to the full set.
ﬁnd a set of k customers that are indicative of our customer In order to make these guarantees, however, several trade-
data set, we are not signiﬁcantly worse by insisting that our offs for runtime, accuracy, and probability of guarantee are
output be real customers rather than a ﬁctional customer that necessary. A larger diameter necessitates a larger sample, to
is somehow more average. prevent outliers from hiding. Increased accuracy or probability
Also, if we do a sampling-based algorithm for applications of success also require more elements to be sampled. The
where reading the entire input is infeasible or intractable, we running time is directly dependent on the size of the sample,
can restrict our attention to the sample for that portion of the so increases in these directly lead to increased runtime, just
program, rather than the entire input, which would defeat the as additional input size does.
point of sampling in the ﬁrst place. Czumaj and Sohler  improve on this result by removing
Furthermore, statistics research has shown that sufﬁciently the dependence on the input size and reducing the dependence
large samples of data converge to behaving the same as the on the diameter in the sampling. Using a smaller sample,
full data set under some characteristic conditions. In essence, they are able to provide a 2(α + ε) approximation to the
this permits that the sample studied need only represent the problem in a signiﬁcantly improved runtime. They also extend
behavior of the entire data set rather than represent the entire the improvement to the Euclidean space metric, providing the
data set . This makes selecting and demonstrating the same α + ε guarantee from the original.
validity of the sample much easier. Meyerson et al  present a sampling-based algorithm to
It is worth noting that any sampling-based (or other solve k-median, within a constant factor, with high probability.
sub-linear) algorithm to solve k-median must make several They will ﬁrst generate several sample sets of size Ω( k log k),
uniformly at random, from the data and will approximate the slowly, local search tracks the optimal value to the end. At
solution to each sample with the algorithm of Arya et al , each stage of penalty value, the previous solution is used as a
choosing the best sample to output. Of course, the decision starting point.
for best will also be approximate: determine the actual cost Charikar et al  present a one-pass solution for k-median
will take Ω(n) time. Instead, they approximate this within a with outliers that is exact on the number of medians. To
constant factor by checking each set in several more random produce this result, they draw a sufﬁciently large sample,
samples, showing that the “best” k-medians chosen by this are run an [α, β]-approximation on that solution, and interpret
a constant approximation of the true best from the choices. the solution to the sample as a solution to the problem. This
This works well on data with large clustering, and if the is based on the observation that a sufﬁciently large random
algorithm of Charikar et al  is used as a subroutine instead sample of N will cluster about the same way that N does. It
of the Arya algorithm, the result will work for k-median with uses O(k log n) space and, with high probability, it produces
outliers. They also show the trade-offs between sample size an approximation on N without a signiﬁcant increase in cost,
and minimum cluster size; if larger samples are allowed, then the number of medians, or the number of outliers over the
the required minimum cluster size in OPT may grow as well. bounds produced by the subroutine.
VI. A LTERNATE VERSIONS OF THE PROBLEM B. Prize-Collecting k-median
A. k-Median with Outliers Charikar et al  are able to produce a 4-approximation to
The k-median problem with outliers is a natural formulation prize-collecting k-median through Lagrangian Relaxation by
for solving k-median in situations where a small fraction of using their algorithm for facility location with penalties and
the data is errors of some sort, noise or otherwise corrupted adding centers from the larger solution to the smaller solution
data, and these errors can disproportionately affect the optimal through various selection rules. This is similar to the approach
solution cost. If the solution identiﬁes the outliers as well, used by many algorithms described earlier in the paper.
they can be subject to manual intervention: removal if the They also note that PTAS from the Euclidean Version can
data point is noise, investigation if it is an anomaly, et cetera. be extended for Euclidean Prize-Collecting k-median, as per
The problem is also worthy of theoretical study as an issue of the results of .
when k-median problems have multiple global constraints.
C. Capacitated k-median
Charikar et al  provide an algorithm for k-median
with outliers, giving a (1+ε)-approximation on the number of The result of Korupolu et al  can be applied to ca-
outliers and a 4(1+ 1 )-approximation on cost. Their algorithm pacitated k-median with splittable demands, providing bicrite-
is based on the use of their 4-approximation to prize-collecting ria approximations of [1 + ε, 12 + 17 ] and [Θ(ε3 ), 5 + 1 ].
k-median. They are able to approximate the optimal cost, C ∗ In both cases the capacity requirement is not broken. As
to within a factor of ε. By setting the penalty from prize with uncapacitated k-median, the desired bound on medians
collecting k-median to C , they are able to get a solution determines the starting point of the search as well as the
to k-median with outliers with the given bounds. approximation guarantee.
Chen  presents the ﬁrst polynomial time constant ap- The result of Arora et al  can be extended to Euclidean
proximation for k-median with outliers. Chen’s algorithm uses capacitated k-median by adding a capacities dimension to
Lagrangian Relaxation, as described in . In this case, the dynamic programming table. Through this approach, Eu-
the Lagrangian relaxation of the problem is facility location clidean capacitated k-median may be approximated to 1 + ε
with outliers. Two solutions will be found in the Lagrangian in nO(log ε ) time.
relaxation phase: one with too few centers and another with Pal et al  give a local search solution to facility
too many. If the one with too many has at least k + 2 centers, location with nonuniform hard capacities and achieve a 9-
he cannot use ’s merge step and must use his own greedy approximation. The hard capacity limit means that the capacity
algorithm. cannot be exceeded by any amount - approximations on
If exactly k + 1 centers are used by the larger solution, then that factor are not acceptable. The local search is similar to
the smaller solution no longer has a provable cost bound, so that used for k-median in starting point, however there are
a solution starting from the larger solution must be found. additional steps allowed between states. In addition to a swap
Lagrangian relaxation can be applied here again. Consider of facility, adding or removing a facility is permitted. The
k-median with outliers and the ability to exclude more than algorithm can still be shown to terminate in polynomial time.
the pre-speciﬁed number of outliers by paying some penalty.
VII. R ELATED P ROBLEMS
This is the same problem if the penalty is inﬁnite but is a
Lagrangian relaxation otherwise. Computing a local search as A. k-Center
a subroutine, starting from the larger solution, and gradually In k-center, the goal is to select k points so as to minimize
increasing the penalty for additional medians will yield the the maximal distance from any given point to its cluster center.
desired approximation. The idea of their local search is to ﬁnd A good way to think of this is as setting up Wireless Access
a set of successor states in such a way as to have a constant Points and minimizing the broadcast radius for all served
approximation among them. Because the penalty is increased points. It is NP-Hard to approximate k-center to a constant
factor better than two , and thus this result is tight. For possible states) or on solution quality (as there are examples
the optimization version, minimizing the maximum radius is on which it can be arbitrarily bad) . Initial attempts by the
NP-Hard . Theory community to resolve this included exact algorithms
Dyer and Frieze  give a simple heuristic for k-center.  and arbitrarily accurate approximations , although
They take the point with the heaviest weight as the ﬁrst k-means remained in regular use.
center and each successive center is chosen greedily as the Recently, two independent results attempted to solve this
point whose distance to its closest center is maximal. This impasse by small modiﬁcations to k-means in order to
achieves an approximation ratio of 3, and if the ratio of the provide it with provable solution quality guarantees and faster
largest weight to the smallest weight is less than two, the runtimes, both provably and in practice.
approximation factor is one plus the weight ratio. Note that this The ﬁrst is due to Ostrovsky et al . They show that
algorithm also works if k-center is cast as an online problem k-means does well when a good clustering exists for the
with k arriving online, as with online k-median  parameters given: that is, the data naturally has a clustering
Hochbaum and Shmoys  provide the best possible for the given value of k. They give a method for seeding the
constant-factor approximation to the metric k-Center problem. Lloyd algorithm that is similar to the k-center solution of 
Theirs is a 2-approximation on the radius, uses an exact and show that only one iteration of local search is necessary
number of centers, and is based on an observation from linear to provide a constant bound for solution quality on such data.
programming duality. The observation is that a dominating set This is naturally faster than Lloyd’s algorithm as no additional
of size k in the square of the graph corresponds to a k-center iterations are necessary.
solution in the original. This, too, is NP-Hard to compute, The second, an algorithm named k-means++, is due to
but a feasible dominating set can be computed from a strong Arthur and Vassilvitskii . In this, the starting solution state is
stable set, which can be approximated in a greedy fashion. chosen carefully, rather than randomly as it was in k-means,
Charikar et al  present a one-pass solution for k- before using traditional k-means for the rest of the algorithm.
Center with outliers that, with high probability, produces an The ﬁrst center is chosen uniformly at random and each
α-approximation on k-center with outliers, approximating the successive center is chosen with probability proportional to
number of outliers by a factor of (1 + ε)2 . It is based on the the square of its proximity to its nearest center. This is known
same ideas as their solution to k-median with outliers (see as D2 weighting. Choosing D2 weighting alone - without
above). It requires, as input, an algorithm that can solve k- the local search step - guarantees an approximation factor of
center with outliers to approximate within a factor of α and Θ(log k), although the practical solution quality is improved
be exact on the number of outliers; such an algorithm is given by the local search. It can be shown that if this is run 2k
in  and is described above. The advantage to using this times, it is likely for the best solution found to be within a
algorithm instead of the one being used as input is that this is constant factor of the optimal. Their empirical results on both
faster with a loss in approximation factor only to the quantity synthetic and natural datasets showed k-means++ is a good
of centers used. improvement in both running time and solution quality over
Another paper of Charikar et al  gives an O(n3 ) time 3- k-means.
approximation to k-Center with outliers. The k-center problem
with outliers cannot be approximated to better than 3 unless C. Facility Location
Facility Location is similar in goal to k-median, but without
B. k-means an explicit bound on the number of medians that may be
The k-means problem is very similar to k-median, except selected. Instead, each point that isn’t a median pays a cost to
the objective function to minimize is the sum of squares for connect to its nearest median, as before, and any point may be
the distance from each point to its center. designated as a median by paying a given cost. This cost may
By far, the most common technique used in practice to solve or may not be uniform across all points. This can be thought
k-means is due to Lloyd in 1982 . The technique is similar of as warehouses for distributing goods to a chain of stores.
in idea to local search in that k means are selected randomly Mettu and Plaxton  give an ofﬂine facility location
and the appropriate clustering for the state is computed. Rather solution using the model of hierarchically greedy that was
than search all p possible swaps, however, the next solution used in their solution to online k-median. Theirs is a 3-
state is the center of mass of the k clusters, taking advantage approximation and runs in time O(n2 ).
of the closed form nature of the k-means problem. Those Charikar and Guha  provide a local-search based solu-
centers are then the basis for a re-clustering, and the process ˜ 2
tion to approximate facility location within 2.414 + ε in O( n )
repeats itself until the solution stabilizes. The solution could, time or to get an approximation tradeoff of (1 + γ, 1 + γ )
in principle, take a great while to converge, but in practice it on facility cost versus service cost. This is near optimal for
runs very quickly. tradeoffs, as no algorithm can achieve a better tradeoff than
It is perhaps confusing that Lloyd’s algorithm - referred to (1+γ, 1+ γ ). Combining  and another previous result and
generally as just k-means - remains in practice despite no adding their own observation of scaling yields an approxima-
provable guarantees on either the running time (beyond the k n tion ratio of 1.728. The combination is important, because no
algorithm based solely on the primal dual technique of  Recently, Kannan et al  proposed a new bicriteria
can do better than 3-ε. formulation to determine the quality of a clustering solution.
The current best-known approximation for facility location Now, the input becomes a similarity graph for the data, with
is a 1.519 factor due to Mahdian et al . The approach used higher edge weights corresponding to vertices that are more
is an integration of many approaches used for both k-median similar. A k-clustering partitions the graph into k components,
and facility location. ideally in such a way that the most similar points are in the
Meyerson  provides online algorithms for facility lo- same connected components. In this, we can judge the quality
cation under different models. If the facilities have uniform of a clustering by the cuts it creates; we want to measure the
costs, then the arrival of each point prompts a decision: shall size of the cut relative to the sizes of the pieces it creates. The
we build a facility here? He proposes randomized decision quality of a clustering is the minimum quality of the clusters
making for this case. If it would be more expensive to connect it creates.
this point to any currently open facility, then we open one This causes a problem in that all vertices are equally im-
here. Otherwise, we open one here with probability equal to portant. We would rather give importance to vertices that have
the ratio of the cost to connect to the nearest facility to the many group similarities. As such, we measure the conductance
cost to open a facility here. If the points arrive in a random of the cut as the ratio of the cost of cut to the similarity that
order, this is constant competitive. However, if the adversary the members on one side of the cut share with the graph.
may choose the order in which the points arrive, the algorithm As with k-median, a few outliers can cause problems.
is O(log n)-competitive. Instead of explicitly choosing to exclude a fraction a priori,
If facilities can have non-uniform costs, the above algorithm we will judge a clustering by a second criterion: the fraction
will no longer work, as the current location may be arbitrarily of the total weight not covered by the clusters. This allows
expensive. Instead, the metric space and associated facility us to exclude outliers by paying a cost, with a smaller cost
costs are given in advance, and when we choose to open corresponding to data that is farther from our desired clusters,
a facility, we may do so anywhere. We must still remain and thus more of an outlier.
competitive on the points mentioned so far, so we cannot The problem statement is then to be given the similarity
simply solve the ofﬂine version and call it an online solution. graph and one of the two parameters: either minimize outlier
By rounding the facility costs to powers of two and considering cost while retaining the desired quality or maximize the quality
multiple choices of where to open the next facility, including without excluding more than the given budget for outliers.
not doing so, he gets a constant competitive algorithm for
randomized arrival order and O(log n) for the adversarial case. ACKNOWLEDGMENTS
For both cases, it can be shown that no online algorithm
I am grateful to Adam Meyerson for providing the starting
can be constant-competitive for adversarial order.
seeds of papers to read. From these, I was able to ﬁnd some
The online solution is also useful as a solution to the ofﬂine
initial results and additional related papers. I am also grateful
problem. On its own, it can provide a constant-approximation
for his feedback on successive drafts of this paper.
in O(n2 ) time by shufﬂing the input points and then processing
each point in the new order as though it were an online
arrival. While the approximation factor delivered is worse
than the guarantees made by other algorithms, this is much  Arora, S., Raghavan, P., and Rao, S. Approximation schemes for Eu-
faster. Furthermore, it can be used as a subroutine to local clidean k-median and related problems. In Proceedings of the 30th Annual
ACM Symposium on Theory of Computing, pages 106-113, 1998.
search-based solutions in order to start the algorithm at a good  Arthur, D. and Vassilvitskii, S. k-means++: The Advantages of Careful
solution. When applied to , described above, the same
√ Seeding. Proceedings of the 15th Annual ACM-SIAM Symposium on
1 + 2 + ε guarantee can be made in O( n ) time.
Discrete Algorithms. 2007.
 Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and
Fotakis  expanded on this result to show that no random- Pandit, V. Local Search Heuristics for k-Median and Facility Location
ized algorithm can do better than Ω( log log n ) on any metric Problems. In Proceedings of the 30th Annual ACM Symposium on Theory
space. He also gave a deterministic version of Meyerson’s of Computing. 2001.
 Babcock, B., Datar, M., Motwani, R., and O’Callaghan, L. Maintaining
algorithm and showed that it achieves a competitive ratio of Variance and k-Medians Over Data Stream Windows, 2003 PODS.
O( log log n ).  Bartal, Y. Probabilistic approximations of metric spaces and its algo-
rithmic applications. In IEEE Symposium on Foundations of Computer
VIII. OTHER N OTIONS OF C LUSTERING Science, pages 184-193, 1996.
 Bartal, Y. On approximating arbitrary metrics by tree metrics. Proceedings
All of the algorithmic approaches discussed thus far use a of the 30th Annual ACM Symposium on Theory of Computing. 1998.
mathematical formula to evaluate the quality of the clustering  Baumol, W.J. and Wolfe, P. A warehouse location problem. Operations
solutions produced. These work well in general and for many Reserach. 1958.
 Bradley, P.S., Fayyad, U.M., and Mangasarian, O.L. Mathematical Pro-
types of data, although each has examples for which the cost gramming for Data Mining: Formulations and Challenges. INFORMS:
metric is easily fooled; that is, there is a natural and correct Journal of Computing, special issue on Data Mining. 1998.
way to group the example data - that is the point of clustering  Charikar, M., Chekuri, C., Goel, A., Guha, S. Rounding via Trees:
Deterministic Approximation Algorithms for Group Steiner Trees and
after all - which the problems, even their exact solution, fail k-median. Proceedings of the 30th Annual ACM Symposium on Theory
to ﬁnd. of Computing. 1998.
 Charikar, M. and Guha, S. Improved Combinatorial Algorithms for  Meyerson, A., O’Callaghan, L. and Plotkin, S. A k-median Algorithm
the Facility Location and k-Median Problems. Proceedings of the 40th with Running Time Independent of Data Size. Journal of Machine
Annual IEEE Conference on Foundations of Computer Science. 1999. Learning, Special Issue on Theoretical Advances in Data Clustering
 Charikar, M., Guha S., Tardos, E., and Shmoys, D. A Constant Factor (MLJ). 2004.
Approximation Algorithm for the k-Median Problem. Journal of Com-  Mishra, N., Oblinger, D., and Pitt, L. Sublinear Time Approximate
puter and System Sciences. 2002. Originally appeared in Proceedings of Clustering. Proceedings of the 9th Annual ACM-SIAM Symposium on
the 31st Annual ACM Symposium on Theory of Computing. 1999. Discrete Algorithms. 2001.
 Charikar, M., Khuller, S., Mount, D., and Narasimhan, G. Algorithms  Ostrovsky, R., Rabani, Y., Schulman, L., and Swamy, C. The effective-
for Facility Location Problems with Outliers. In Proceedings of the 12th ness of Lloyd-tpye methods for the k-Means problem. Proceedings of the
Annual ACM-SIAM Symposium on Discrete Algorithms. 2001. 47th Annual Symposium on Foundations of Computer Science. 2006.
 Charikar, M., O’Callaghan, L., and Panigraphy, R. Better Streaming ´
 Pal, M., Tardos, E., Wexler, T. Facility location with nonuniform hard
Algorithms for Clustering Problems. Proceedings of the 35th Annual ACM capacities. Proceedings of the 42nd Annual Symposium on Foundations
Symposium on Theory of Computing. 2003. of Computer Science. 2001.
 Chen, K. A constant factor approximation algorithm for k-median  Scott, A.J. Operational analysis of nodal hierarchies in network systems.
clustering with outliers. Proceedings of the 19th Annual ACM-SIAM Operations Research Quarterly. 1971.
Symposium on Discrete Algorithms. 2008. ´
 Shmoys, D., Tardos, E., and Aardal, K. Approximation algorithms
 Czumaj, A. and Sohler, C. Sublinear-Time Approximation Algorithms for facility location problems. Proceedings of the 29th Annual ACM
for Clustering via Random Sampling. Random Structures and Algorithms. Symposium on Theory of Computing. 1997.
2007.  Stollsteimer, J.F. A working model for plant numbers and locations.
 Dyer, M. and Frieze, A.M. A simple heuristic for the p-center problem. Journal of Farm Economics. 1963.
Operations Research Letters. 1985.  Tamir, A. An O(pn2 ) algorithm for the p-median and related problems
 Fakcharoenphol, J., Rao, S., and Talwar, K. A tight bound on approxi- on tree graphs. Operations Research Letters. 1996.
mating arbitrary metrics by tree metrics. Proceedings of the 35th Annual
ACM Symposium on Theory of Computing. 2003.
 Feldman, E., Lehrer, F.A., and Ray, T. L. Warehouse location under
continuous economies of scale. Management Science. 1966.
 Fotakis, D. On the Competitive Ratio for Online Facility Location.
International Colloquium on Automata, Languages and Programming.
 Guha, S., Mishra, N., Motwani, R., and O’Callaghan, L. Clustering
Data Streams. Proceedings of the 41st Annual IEEE Conference on
Foundations of Computer Science. 2000.
 Har-Peled, S. and Mazumdar, S. Coresets for k-means and k-median
clustering and their applications. Proceedings of the 36th Annual ACM
Symposium on Theory of Computing. 2004.
 Hochbaum, D. S., and Shmoys, D. B. A Best Possible Heuristic for the
k-Center Problem. Mathematics of Operations Research. 1985.
 Inaba, M., Katoh, N., and Imai, H. Applications of weighted voronoi
diagrams and randomization to variance-based k-clustering. SCG ’94:
Proceedings of the tenth annual symposium on Computational geometry.
 Indyk, P. Sublinear Time Algorithms for Metric Space Problems. Pro-
ceedings of the 31st Annual ACM Symposium on Theory of Computing.
 Jain, K. and Vazirani, V. Approximation Algorithms for Metric Facility
Location and k-Median Problems Using the Primal-Dual Schema and
Lagrangian Relaxation. Journal of the ACM. 2001.
 Jain, K. and Vazirani, V. Primal-dual approximation algorithms for
metric facility location and k-median problems. Proceedings of the 40th
Annual Symposium on Foundation of Computer Science. 1999.
 Kannan, R., Vempala, S., and Vetta, A. On Clusterings: Good, Bad,
and Spectral. Proceedings of the 41st Annual IEEE Conference on
Foundations of Computer Science. 2000.
 Keuhn, A. A. and M.J. Hamburger. A heuristic program for locating
warehouses. Management Science. 1963.
 Korupolu, M., Plaxton, C., and Rajaraman, R.. Analysis of a local search
heuristic for facility location problems. Proceedings of the 9th Annual
ACM-SIAM Symposium on Discrete Algorithms. 1998.
 Lloyd, S.P. Least squares quantization in pcm. IEEE Transactions on
Information Theory. 1982.
 Lin, J. and Vitter, J.S. ε-Approximations with Minimum Packing Con-
straint Violation. Proceedings of the 24th Annual ACM Symposium on
Theory of Computing. 1992.
 Lin, J. and Vitter, J.S. Approximation Algorithms for Geometric Median
Problems. IPL. 1992.
 Mahdian, M., Ye, Y., and Zhang, J. Approximation Algorithms for
Metric Facility Location Problems. SIAM Journal on Computing. 2006.
 Manne, A.S. Plant Location under economies-of-scale-decentralization
and computation. Management Science. 1964.
 Mettu, R.R. and Plaxton, C.G. The Online Median Problem. Proceedings
of the 41st Annual Symposium on Foundations of Computer Science. 2000.
 Meyerson, A. Online Facility Location. IEEE Symposium on Founda-
tions of Computer Science. 2001.