Document Sample

Approximation Algorithms for the Metric k -Median Problem Michael Shindler √ Paper type: Research [ ] Critical survey [ ] A. Variations of k-median Main area(s): Theory of Computation In addition to k-median, as described above, there are some Abstract—A short overview of recent results in clustering naturally similar problems that arise, either due to the manner related algorithms. Exact algorithms and approximations for k- median of various quality are discussed, as are variations of this by which the data was collected or the manner by which the problem for different input and space/time requirements. Natural data will be used. Variations of k-median include: alternative formulations of k-median are covered as are related • Online k-median. In online algorithms, some part of the problems. Also discussed is the application of such solutions to input is unknown in advance. In the case of online k- various information-centric uses. This paper concludes with a discussion about clustering metrics as a whole. median, the full metric space is known, but k is not. The Index Terms—clustering, survey paper, k-median, streaming goal is to determine a permutation of the points such that algorithm, sampling algorithm, approximation algorithm, linear the ﬁrst k serve as a good solution to k-median for that programming, data mining, operations research k, for all k. This can be thought of as asking for k in an online fashion. In related formulations, demand points I. I NTRODUCTION arrive online and k can be increased for a cost. The latter The past decade has seen signiﬁcant advances in approx- model is more common for facility location. imating several prominent NP-Hard problems from the ﬁeld • k-median with diameter, in which the maximum distance of operations research. At the center of this is the k-median between two points in N is ∆ and the minimum distance problem, in which an appropriate partitioning of the data set between two is δ. This effectively limits the amount by into k representative clusters is sought. This is one of several which a point can affect the optimal cost. metrics available to quantify the quality of a clustering. • Prize-collecting k-median, in which some points may be In this survey, I give an overview of many of the applications excluded from the solution cost, either up to a limit or by of k-median and related problems, as well as several forms paying an arbitrary cost to not assign these to any of the of algorithmic solutions to natural formulations of the task. k medians. This is sometimes known as k-median with In section 2, I provide a formal deﬁnition of the k-median penalties. This puts an upper bound on the cost by which problem as well as many variations of it. In section 3, I explain any point can affect the optimum cost. applications of k-median: why solutions to the problem are • k-median with outliers, in which some fraction η of useful in practice. In section 4, I overview more than a the points don’t count towards the cost of the solution; decade’s worth of results related to the problem. In section 5, this prevents small, distant clusters from signiﬁcantly k-median under alternate forms of input is explored. Section 6 affecting the approximation guarantee. discusses alternate forms of the problem. Section 7 discusses • Capacitated k-median, in which each median can have problems that are similar to k-median. I conclude with a at most C data points in its cluster. This is sometimes discussion of alternate notions of clustering. known as robust k-median. II. P ROBLEM S TATEMENT B. Similar problems to k-median Given N , a set of points in some metric space, and some Similar problems to k-median include k-means, k-center, integer k, our goal is to select k points, K ⊆ N . Once the and facility location. Approaches from these can be used to points are selected, the solution quality is the sum of all N ’s assist results in k-median and vice versa, and as such, are points’ distances to their nearest element of K; a higher quality worth considering when studying k-median. solution corresponds to a lower cost. The set of data points • In k-center, the goal is to select k points so as to minimize that have a given m ∈ K as their closest is known as m’s the maximal distance from any given point to its cluster cluster. Point m is the cluster’s center, or median. center. This can be seen as setting up Wireless Access The k-median problem is NP-Hard and can be approximated Points and minimizing the broadcast radius for all served both by solution cost and by the number of medians produced points. by the solution. In general, an [α, β]-approximation for k- • In k-means, the objective function to minimize is the sum median guarantees a cost of at most αOPT and uses at most of squares for the distance from each point to its center. βk medians. If an approximation factor is listed as a single In Euclidean space, the 1-mean corresponds to the center expression, the algorithm uses exactly k medians and the of mass, although the 1-median does not have such a expression is the cost ratio to OPT. closed form [21] in two or more dimensional space. • Facility Location is similar in goal to k-median, but answer database queries for which traditional relational models without an explicit bound on the number of medians that lack the expressive power to ask in the ﬁrst place or for which may be selected. Instead, each point that isn’t a median it is difﬁcult for a user to pose a speciﬁc query. Fraud detection pays a cost to connect to its nearest median, as before, and event cataloging fall into this category. It can also be used and any point may be designated as a median by paying to alleviate the so-called “write-only memory” phenomenon a given cost. Facility Location is used as a subroutine in that occurs when large amounts of data are gathered, making many k-median solutions. it difﬁcult for humans - and most algorithms - to read and The total cost of a facility location solution is split into understand [8]. the service cost paid to connect points to medians and the facility cost, incurred by opening new medians (facilities). IV. R ESULTS ON k-M EDIAN It is worth comparing, brieﬂy, the similarities of k-median The k-median problem has been studied in many forms. In and facility location with uniform facility costs. An α- 1992, the ﬁrst polynomial-time approximation algorithm with approximation for k-median is itself an α-approximation provable performance guarantees was discovered. Nine years for that version of facility location, as a search can be later, the ﬁrst constant approximation was published. done for the best value of k. However, even an exact algorithm for that version of facility location does not A. Hardness Results give a solution for k-median. In a sense, this makes k- The non-metric k-median problem is NP-Hard and is as median the harder problem. hard to approximate as Set Cover, and as such, no approxima- tion better than O(log n) may exist unless P = NP. The same III. A PPLICATIONS OF k-M EDIAN is true for non-metric facility location [25]. The need for efﬁciently locating sources of supply causes The metric k-median problem is NP-Hard and Max-SNP demand for accurate solutions to the k-median problem and its Hard. Approximating k-median within ε and maintaining variants. Efﬁcient location permits quick and cheap transport- exactly k medians is also NP-hard. ation to customers from warehouses and improved customer Online solutions in the model of k arriving online cannot relations due to speed of delivery [28] [34]. The solution is 2 achieve better than a 2 - n−1 -competitive ratio due to the of concern to the ﬁrm(s) in question, as it is both a guide star graph with unit-cost edges. The center must be chosen as for maximizing proﬁts and is instructional for infrastructural the ﬁrst point, otherwise 1-median would be n-competitive on investment [43]. It also can be used by analysts both interior that graph, and the optimal solution for n − 1-median involves and exterior as a measure of the efﬁciency of the ﬁrm or selecting everything except the center piece [35]. of the industry as a whole [43]. Minimizing the distance to The k-median problem in general graphs is (1+ 2 ) ≈ 1.736- e facilities from serving locations can increase the speed with hard to approximate [37]. Facility location over the same is which inventory can be replaced at sales locations, and thus 1.463-hard to approximate [20]. drive up sales [7]. Facility Location is similarly used, as it models the fall B. Optimal Solutions on Special Metrics in shipping cost associated with increased warehousing while Tamir [44] gives an O(n2 k) dynamic programming algo- balancing total cost [18]. This is convenient as most geograph- rithm to solve k-median on tree metric. The algorithm pro- ical systems can be represented as these two basic elements, ceeds from the leaves of the tree up to the root. He observes and it is quite apparent when a situation is described in that it gains the same time bound when applied to forest economical terms [41]. metrics. He cites a previous result showing an O(kn) time For some input data, such as customer transaction records, a algorithm for instances in which the input graph is a path. In k-median solution can be interpreted as k “typical” customers’ each of these three cases, the solution is optimal and solvable information. This is great for targeted advertising (by discov- in pseudo-polynomial time. ering market segments) or a stockholder summary. Similar The usefulness for tree metric solutions isn’t limited to cases applications exist for large data sets, such as web pages or where the metric is given as a tree. A common application phone records. Phone record clustering is useful in detecting of this result is that it is used to form approximations for telephone fraud [4]. Capacitated k-median can be used to general metrics by embedding those metrics into trees and determine locations for setting up proxy servers on the web. then solving the problem optimally on the tree or on the Problems in the form of k-center are particularly important probability distribution over trees produced. This results in an for cases when each distance from a distribution center, rather approximation factor of exactly the distortion produced by the than total distance, causes additional expenses or hardships. embedding. In 1996, Bartal [5] provided a method to embed Perishable goods, such as fresh milk, require refrigeration to any metric space into a probability distribution over trees in transport over distance [43]. such a way that the expected distortion is O(log2 n). That Clustering is also useful in machine learning and classiﬁ- is, if d(x, y) is the distance between x and y in the original cation as similar cases can be grouped either into an epitomical metric, the distance between them in the newly formed tree sample or into groupings of the same, permitting quick class- metric is expected to be O(log2 n)d(x, y). The trees formed iﬁcation of future items [26]. Clustering can also be used to are k-hierarchically well-structured trees (“k-HSTs”). They are rooted, weighted trees formed in such a way that, for all nodes, D. Bicriteria Approximations the weight of the edge to any given child is the same and are a decrease of at least a factor of k from the edge that extends The ﬁrst polynomial time approximation algorithm with from its parent to it. a provable performance guarantee is due to Lin and Vitter [31], [32]. Their algorithm provides a [1 + ε, (1+ 1 )(ln n + 1)] ε In 1998, Bartal [6] improved this to O(log n log log n) approximation on k-median. They also have a [2(1 + ε), 1 + 1 ] ε distortion and also provided a deterministic version of the approximation. The latter requires additional input in the form result. This was done by instead computing hierarchically of a bound on the optimal cost. While this approach is based partitionable metrics (“HPMs”), on which the graph is given on linear program relaxation as a starting point, it does not lengths such that, on a cut in the graph, the lengths crossing employ traditional rounding ; for example, some variables may the cut are at least as much as the diameter and then each be set to zero in the relaxed linear program solution but be partition formed by the cut obeys this rule as well, with a set to 1 by the algorithm before the solution settles. They do (naturally) smaller diameter. The HPMs are then translated to this by ﬁltering the results, guaranteeing that each vertex is HSTs. assigned to a center in its approximate neighborhood. There exist metrics for which any tree embedding must have Korupolu et al [29] provide an analysis of k-median solved Ω(log n) distortion, and for a few years, the gap remained by a local search heuristic. Local search is a technique that open in general graphs. In 2003, Fakcharoenphol, Rao, and starts with some plausible solution, perhaps obtained by an- Talwar [17] closed the gap by providing embeddings of other approximation algorithm or by an arbitrary starting point, general metrics into trees with O(log n) distortion. Rather than and incrementally improves the solution until no immediate constructing the set of trees directly, as Bartal had done, they improvement can be found. In the case of k-Median, any decompose the graph by growing clusters’ diameters, cutting subset of N of size k is a possible solution, although it is crossed edges with probability based on the length of the not necessarily a good one. It is possible to improve the edge relative to the diameter. Once the graph is decomposed solution by taking some p medians (1 ≤ p ≤ k) and p non- in this manner, the tree distribution can be grown from the medians and “swapping” them, making the medians into non- components formed. medians and vice versa. All n data points are then reassigned Before the gap was closed, Charikar et al [9] gave an to their nearest of the new set of medians and the solution approximation by derandomized the process that produced the cost is updated. If any of the possible improvements leads to tree distribution. They did this by solving the linear program an improved solution cost, that becomes the new temporary relaxation on the original problem and building a set of trees solution; in the event that multiple lead to improvement, the such that the fractional solution is exact on the distribution of best improvement is chosen. If none lead to improvement, the trees produced. The rounded solution from here produces an solution whose cost could not be improved upon is the ﬁnal O(log n log log n) approximation, which can be improved to result. Note that this ﬁnds the local optimum, which is not O(log k log log k) with an idea from [31]. As this was achieved necessarily the global optimum. The approximation guarantee prior to the O(log n) embedding of [17], there is no explicit is based on the worst-possible local optimum as compared to claim to whether or not O(log k) is possible through this the global optimum. This is known as the locality gap. approach. I believe it is possible, although I leave it unproven Korupolu et al consider local search with one swap and as solutions to far better factors than O(log k) exist for k- provide a guarantee of user’s choice of [1 + ε, 10 + 17 ] or ε median. [Θ(ε3 ), 5 + 1 ]. The guarantee is formed by showing that for ε any solution state sufﬁciently more expensive than the optimal, C. Arbitrary Precision within Special Metrics there is a state one swap away that has a lower cost. This essentially proves a bound on the locality gap, thus showing Arora et al [1] give the ﬁrst arbitrary precision approxima- that upon termination, local search is within that bound of tion for planar k-median in the form of a PTAS. Their solution optimal. This also helps to show that it takes polynomial time 1 takes time nO(1+ ε ) and has an expected approximation ratio to run, as there are a polynomial number of swaps possible of 1 + ε with constant probability. Their algorithm works at each stage and polynomially-many that can happen before by dividing the plane into a point-quadtree and using a convergence. The chosen bound β on the number of medians dynamic programming approach on that result, limiting how determines the starting position of the local search: any set of assignments may be made outside of their individual sectors βk points. within the quadtree. This is a generalization of the approach Indyk [24] gives an approximation that produces a [(1 + Arora used to solve Euclidean TSP to arbitrary precision. γ)3(2+α), 2β] approximation with probability Ω(γ), for some In the event of a small value of k, a slightly different conﬁdence parameter γ > 0 and some [α, β] approximation approach may be used. Instead of restricting association out ˜ for k-median that runs in time O(n2 ), such as any mentioned of a segment of the quadtree, the median locations themselves in this paper with such a guarantee. The error probability 1 may be approximated. This results in a 1 + ε-approximation can be reduced to a constant with O( γ ) repetitions of the with runtime n(O(log n ))2k , a notable improvement when k ε algorithm, taking the smallest cost solution produced. The is small. ˜ 2 runtime of a single iteration is O( kn ). It works by taking δ √ ˜ a single, sufﬁciently large (O( nk)) sample of points in N Note that each time we call facility location, we have essen- and running the provided [α, β] approximation algorithm on tially relaxed the requirement that exactly k medians be opened the sample. From the ﬁrst sample, the fraction with the largest by moving the requirement to the objective function as a cost. values for distance to their medians are separated and black- This is the essence of Lagrangian Relaxation, introduced by box algorithm is run again on these. The solution is the output Jain and Vazirani [25], and applied to produce an approximate of both runs of the black-box algorithm, for a total of 2βk solution. The realization for this came by observing that the medians. k-median algorithm of Lin and Vitter uses ideas from constant- factor approximations for facility location. The Lagrangian E. Constant-factor Approximations Relaxation technique formalizes the relationship between these two problems. The ﬁrst constant factor approximation that is exact on the A problem arises in this, however, because facility location number of medians used is due to Charikar et al [11] in 2001. isn’t solved exactly by our subroutine: if we were going to The algorithm produces a 6 2 -approximation by linear program 3 spend the time to solve it exactly, we could just solve k-median relaxation. From the linear program solution, a collection of itself directly. Furthermore, it isn’t immediately obvious that trees are built and the problem is then solved optimally on the an approximate solution to facility location on the new cost newly formed forest metric. metric is a bounded approximation on k-median. We need More speciﬁcally, they begin by consolidating nearby data to be more careful. In this case, if a solution with exactly points, modifying the demands so as to not change the linear k medians isn’t found, either due to precision or a deliberate program solution. Without changing the cost of the fractional decision to limit the number of iterations, we will still have two solution, they are able to form a semi-sparse data set, with surrounding cases, one with more medians and one with fewer. all positive demand points apart from one another. They then Accepting a small loss in the approximation factor, the two modify the linear program solution on these to consolidate can be combined via a convex combination and randomized nearby fractional centers. The rounding, viewed as a solution rounding. In general, the Lagrangian Relaxation technique, as over a forest metric, produces the approximate solution. described by Jain and Vazirani, is to take any hard constraint, Jain and Vazirani [25] present a 6-approximation. This, on such as the number of medians, and move it from the linear its own, is not the key important contribution of their paper, as programming constraints to the objective function, adding it Arya et al gave a 3 + ε-approximation the same year. Jain and (times some multiplier) to any objective function already there. Vazirani, however, introduce the use of primal-dual schema to The problem is then solved as a subroutine, either explicitly or incorporate into the algorithm, instead of relying on solving a using a known result for the relaxed problem, and the solution linear program explicitly and forming an integral solution from is interpreted as a solution to the original. there. In addition to giving an improved runtime, this technique Lagrangian Relaxation can also be applied to solve prob- also permits improvements based on the structure of the lems where two related problems are joined by a k-median problem - an advantage absent in traditional linear program- type constant. Jain and Vazirani [25] give the example of based solutions. This leads to both more efﬁcient solutions and needing two types of facilities (schools and hospitals) and solutions that are strictly combinatoric - eliminating the linear clustering by connecting points to one of each of these, but program-based portions of the algorithm. This technique can only being able to build k total facilities. The Lagrangian apply to other problems as well and is thus relevant outside Relaxation of this problem becomes two facility location of k-median. problems. Another technique may well be to use facility location as a Charikar and Guha [10] improve the result of [25] to subroutine. For illustration purposes, let us ﬁrst consider how ˜ give a 4-approximation in O(n3 ) time. After the small and this would work if we were to attempt to solve the problem large solutions are found, an additional augmentation step is exactly. Our goal is to ﬁnd k medians, and facility location performed by moving medians from the large solution to the does ﬁnd medians, but it does not allow us to specify an small. This is made possible by their observation that the dual explicit limit on the number of medians. Instead, it allows variables are continuous and small changes in the facility cost us to specify the cost per facility. We can use a solution to are bounded in their effect on the dual variables. facility location as a black box and do a binary search on A better approximation was found later the same year by the cost per facility input in order to ﬁnd a cost that yields Arya et al [3] using local search, improving over the previous exactly k facilities. For any given attempt at cost per facility, local search heuristic of [29]. Arya et al show that if p swaps if fewer than k facilities are opened, we can conclude that are allowed at each stage, the approximation guarantee is 3 + 2 the facilities were overpriced and lower the cost for the next p . With only one swap permitted at each stage, this is a 5- iteration (using a classic principle from economics: to sell approximation, with additional swaps taking additional time more of something, lower the price). Alternately, if facility but providing an improvement on the solution quality. Any location reports too many medians, then the price of facilities amount of swaps per stage yields an improved solution quality 2 is too low. Eventually, assuming facilities may have any real over the 6 3 -approximation of [11]. cost, we will converge on a solution in which k facilities are The proof of the single swap case revolves around the opened; this is our solution. idea of “capturing” an optimal median. An optimal median is captured by some element in some solution if more than half solution along the way as requested. In addition to being useful of the points assigned to the optimal median in the optimal for information that may literally be streaming in real time, solution are being serviced by the capturing point. Note that these algorithms are also useful for cases where random access the capturing point is bad for us. We are fortunate in that to data is expensive or non-existent or where multiple reads optimal facilities can be swapped in exactly once and any point are prohibitively expensive. It also applies to database-related that captures exactly one optimal point will be swapped out cases when the query to be executed references more data than for the optimal point by local search. Furthermore, if a point can ﬁt in main memory [4]. captures two or more optimal points, it will not be swapped The streaming algorithm of Guha et al [20] reads the out by single swap - it is these points that cause the locality data m points at a time, clustering them using some [α, β] gap. Using these facts, an upper bound on the cost increase bicriteria approximation algorithm, and repeats, remembering by swaps made or not made can be proven. Combining this the weighted medians as “level-1 medians.” When m weighted with the limit on how many swaps may occur during a run of medians are stored at level-i, the weighted points are clustered local search gives us the locality gap. into βk level-i+1 medians. When the program is done reading For the case of multiple swaps, the notion of capture is all of the input - or we wish to cluster based on what has been extended to any subset of the current solution state. A subset seen so far, all medians seen so far are clustered into k ﬁnal captures a set of optimal points: namely it captures any optimal medians. The result is a constant approximation. point for which the subset covers half or more of the points that The paper also gives a one-pass O(nk log n) time and the optimal point covers in the optimal solution. Any subset O(nε ) space algorithm. It uses the local-search algorithm making a capture is termed bad, as before. Note that for any of Arya et al [3] as a subroutine and produces a constant subset of size one, the deﬁnitions of capture are identical. approximation ratio for small values of k. Because each iteration of the local search is able to remove Charikar et al [13] improve on Guha’s streaming result. a bad median from the current solution state, and the number Their streaming algorithm is O(kpoly log n) space and is a of times a median can be swapped in or out continues to be constant approximation with high probability. It uses online 2 facility location, due to [36] and described later in this paper, bounded, the approximation guarantee of 3 + p can be shown. as a subroutine. By bounding facility cost and the quantity V. A LTERNATE F ORMS OF I NPUT of medians (facilities) produced by the online facility location As k-median is important to large-scale data mining, it has algorithm, they are able to get performance guarantees for also been studied as an online problem, as streaming, and as their streaming algorithm. To guarantee a constant-factor ap- sampling, all of which are necessary due to the exponential proximation on cost from the algorithm, O(k log n) medians growth in available data. are selected instead, as they cannot shufﬂe the input to get a constant approximation. When the data stream has been A. Online solutions processed in its entirety, the O(k log n) weighted medians are Mettu and Plaxton [35] provided a constant approximation consolidated into exactly k. The algorithm takes one pass over for the online k-median problem. The algorithm runs in time the data, provides a constant approximation with probability 1 O(n2 +nl), where l is the number of bits necessary to represent 1 - poly(n) , and uses only the space necessary to represent the distance between two points in the metric space. They O(k log2 n) points from the stream. show that a traditional greedy algorithm, selecting the next Charikar et al also describe a solution to the asymmetric median so as to minimize the objective cost given the previous k-median problem, in which it is not guaranteed that d(x, y) pieces, has an unbounded competitive ratio. Theirs, instead, is = d(y, x), although reﬂexivity and directed triangle inequality hierarchical greedy: it picks a point greedily, but rather than continue to apply. Their non-streaming solution is essentially settle on that point, it conﬁnes its next search to that point’s to grow radii among points to cover additional uncovered neighborhood. It repeats this process until there is only one points at the cheapest cost. This is similar to a greedy point left to pick; this takes O(log ∆ ) repetitions. While the δ approach used for k-center and works solely if ∆ is known δ constant-competitiveness of this algorithm is useful in its own in order to give a size by which to iterate. This provides an right, it is also of particular use in that it can be run once and [O(1), 2k log ∆ ]-approximation. ε δ then queried for multiple values of k without reprocessing the Their streaming result for the same is done by showing input. that the non-streaming can be modiﬁed to take less memory and proceed over O(log ∆ ) passes. The new algorithm uses δ B. Streaming-based solutions O(k 2 log ∆ ) space. δ In many cases, it is not practical to store all the data and Streaming solutions are also applied to cases where old read it repeatedly. For instance, if customer data, such as data needs to expire and is no longer considered relevant. purchase information, were desired to be clustered and the Babcock et al [4] present an algorithm for k-median over then-current k medians output nightly, reading all the previous sliding windows, in which only the R most recent points are data before clustering becomes prohibitively expensive and relevant, can be chosen as medians, and affect the optimal redundant after a short period of time. Streaming algorithms solution. Any data points before the R are deemed to have maintain an approximate solution state and can output a partial expired. Furthermore, the memory requirements of streaming remain in effect. Their initial solution takes O( τk4 R2τ log R) assumptions. First, the assumption must be made that data 1 memory and produces a [2O( τ ) , 2]-solution for τ < 1 , chosen 2 points can be selected uniformly at random without needing by the user for trade-off purposes. This can be extended to to read the entire input in the process; if this assumption does produce exactly k medians with the same approximation factor not hold, the algorithm might as well be a streaming algorithm, and memory requirement. As in the algorithm of Guha et al, as the reading the data on its own will take linear time. Also, if medians at intermediate levels are maintained, as are buckets the sample is not obtained by randomness, an adversary could for relevance of oldest to newest, for expiration purposes. Each deliver a misleading sample that ignores even big clusters. It bucket is split into the different levels, as per Guha’s algorithm. is also necessary to assume that the data is either clustered by An overestimate of the clustering cost of the bucket is also OPT to be reasonably large (Ω( n ) in size) or that the problem k maintained. Each median must also track approximately how being solved is actually k-median with outliers. Alternately, many active points are in its cluster. When groups of medians the diameter of the problem could be small. If these are not are clustered, they are weighted according to the number of the case, then a small cluster of outliers could hide from the active points each contains. sampling and a linear scan of the data would need to detect For the case of streaming data that arrives in an online fash- this, again defeating the beneﬁts of sampling. This assumption, ion, this can be extended with the same memory requirement in any form, essentially prohibits a small subset of the points ˜ and O(k) time to update at each additional arrival point. from unduly affecting the optimal solution. Finally, in any sampling-based solution, the k medians themselves must be C. Sampling-based solutions an acceptable output to the program, as outputting an explicit With the rapid growth of stored data, sometimes even linear- mapping N → K would take O(n) time, even if the solution time algorithms are impractical. As such, it becomes important were magically produced. to condense data for consumption and analysis. Sampling- Mishra et al [38] were the ﬁrst to apply this technique to get based algorithms solve this by guaranteeing an approximation an approximation algorithm for k-median that runs in sublinear ratio in time sublinear in the size of the data. As this prohibits time. Speciﬁcally, they show that, given an α-approximation even a full reading of the data, assumptions about it must to k-median (such as, say, any algorithm from Section IV-E be made, such as those about the quantity and importance of of this paper), they can produce a 2α + ε-approximation with outliers as well as the distribution of the relevant data. probability arbitrarily close to one. If the metric is Euclidean For any sampling solution, the following result is useful: space ( d ), the approximation factor is α + ε. Let S ⊆ N and K ⊆ S be the optimal k-median The algorithms they present assume also that the diameter solution on S. K is a 2-approximation on what the of the space is known, although this can be estimated if it k-median solution on S would be if any point from is not known a priori. Because any sample of sufﬁcient size N , rather than only the points in S, could be selected will behave like the full set for some classes of functions, and as medians when solving k-median on S. because they are able to show that computing the diameter is A version of this fact is cited in [37] and in [38]. The fact is such a function, computing the diameter of a sufﬁciently large useful in any sampling solution to k-median for two reasons: sample provides an estimate, if such is needed. First, if we consider N to be a subset of an inﬁnite metric To prove the approximation bounds, they show that the ratio space - which it is - then any solution on N is only a factor of the average distance of a point to the closest approximate of two away from an unconstrained solution and has a much median in the sample converges quickly to the same ratio over more conﬁned search space. For example, if we are trying to the full set. ﬁnd a set of k customers that are indicative of our customer In order to make these guarantees, however, several trade- data set, we are not signiﬁcantly worse by insisting that our offs for runtime, accuracy, and probability of guarantee are output be real customers rather than a ﬁctional customer that necessary. A larger diameter necessitates a larger sample, to is somehow more average. prevent outliers from hiding. Increased accuracy or probability Also, if we do a sampling-based algorithm for applications of success also require more elements to be sampled. The where reading the entire input is infeasible or intractable, we running time is directly dependent on the size of the sample, can restrict our attention to the sample for that portion of the so increases in these directly lead to increased runtime, just program, rather than the entire input, which would defeat the as additional input size does. point of sampling in the ﬁrst place. Czumaj and Sohler [15] improve on this result by removing Furthermore, statistics research has shown that sufﬁciently the dependence on the input size and reducing the dependence large samples of data converge to behaving the same as the on the diameter in the sampling. Using a smaller sample, full data set under some characteristic conditions. In essence, they are able to provide a 2(α + ε) approximation to the this permits that the sample studied need only represent the problem in a signiﬁcantly improved runtime. They also extend behavior of the entire data set rather than represent the entire the improvement to the Euclidean space metric, providing the data set [38]. This makes selecting and demonstrating the same α + ε guarantee from the original. validity of the sample much easier. Meyerson et al [37] present a sampling-based algorithm to It is worth noting that any sampling-based (or other solve k-median, within a constant factor, with high probability. sub-linear) algorithm to solve k-median must make several They will ﬁrst generate several sample sets of size Ω( k log k), ε uniformly at random, from the data and will approximate the slowly, local search tracks the optimal value to the end. At solution to each sample with the algorithm of Arya et al [3], each stage of penalty value, the previous solution is used as a choosing the best sample to output. Of course, the decision starting point. for best will also be approximate: determine the actual cost Charikar et al [13] present a one-pass solution for k-median will take Ω(n) time. Instead, they approximate this within a with outliers that is exact on the number of medians. To constant factor by checking each set in several more random produce this result, they draw a sufﬁciently large sample, samples, showing that the “best” k-medians chosen by this are run an [α, β]-approximation on that solution, and interpret a constant approximation of the true best from the choices. the solution to the sample as a solution to the problem. This This works well on data with large clustering, and if the is based on the observation that a sufﬁciently large random algorithm of Charikar et al [12] is used as a subroutine instead sample of N will cluster about the same way that N does. It of the Arya algorithm, the result will work for k-median with uses O(k log n) space and, with high probability, it produces outliers. They also show the trade-offs between sample size an approximation on N without a signiﬁcant increase in cost, and minimum cluster size; if larger samples are allowed, then the number of medians, or the number of outliers over the the required minimum cluster size in OPT may grow as well. bounds produced by the subroutine. VI. A LTERNATE VERSIONS OF THE PROBLEM B. Prize-Collecting k-median A. k-Median with Outliers Charikar et al [12] are able to produce a 4-approximation to The k-median problem with outliers is a natural formulation prize-collecting k-median through Lagrangian Relaxation by for solving k-median in situations where a small fraction of using their algorithm for facility location with penalties and the data is errors of some sort, noise or otherwise corrupted adding centers from the larger solution to the smaller solution data, and these errors can disproportionately affect the optimal through various selection rules. This is similar to the approach solution cost. If the solution identiﬁes the outliers as well, used by many algorithms described earlier in the paper. they can be subject to manual intervention: removal if the They also note that PTAS from the Euclidean Version can data point is noise, investigation if it is an anomaly, et cetera. be extended for Euclidean Prize-Collecting k-median, as per The problem is also worthy of theoretical study as an issue of the results of [1]. when k-median problems have multiple global constraints. C. Capacitated k-median Charikar et al [12] provide an algorithm for k-median with outliers, giving a (1+ε)-approximation on the number of The result of Korupolu et al [29] can be applied to ca- outliers and a 4(1+ 1 )-approximation on cost. Their algorithm pacitated k-median with splittable demands, providing bicrite- ε is based on the use of their 4-approximation to prize-collecting ria approximations of [1 + ε, 12 + 17 ] and [Θ(ε3 ), 5 + 1 ]. ε ε k-median. They are able to approximate the optimal cost, C ∗ In both cases the capacity requirement is not broken. As to within a factor of ε. By setting the penalty from prize with uncapacitated k-median, the desired bound on medians ∗ collecting k-median to C , they are able to get a solution determines the starting point of the search as well as the εη to k-median with outliers with the given bounds. approximation guarantee. Chen [14] presents the ﬁrst polynomial time constant ap- The result of Arora et al [1] can be extended to Euclidean proximation for k-median with outliers. Chen’s algorithm uses capacitated k-median by adding a capacities dimension to Lagrangian Relaxation, as described in [25]. In this case, the dynamic programming table. Through this approach, Eu- the Lagrangian relaxation of the problem is facility location clidean capacitated k-median may be approximated to 1 + ε n 2 with outliers. Two solutions will be found in the Lagrangian in nO(log ε ) time. relaxation phase: one with too few centers and another with Pal et al [40] give a local search solution to facility too many. If the one with too many has at least k + 2 centers, location with nonuniform hard capacities and achieve a 9- he cannot use [25]’s merge step and must use his own greedy approximation. The hard capacity limit means that the capacity algorithm. cannot be exceeded by any amount - approximations on If exactly k + 1 centers are used by the larger solution, then that factor are not acceptable. The local search is similar to the smaller solution no longer has a provable cost bound, so that used for k-median in starting point, however there are a solution starting from the larger solution must be found. additional steps allowed between states. In addition to a swap Lagrangian relaxation can be applied here again. Consider of facility, adding or removing a facility is permitted. The k-median with outliers and the ability to exclude more than algorithm can still be shown to terminate in polynomial time. the pre-speciﬁed number of outliers by paying some penalty. VII. R ELATED P ROBLEMS This is the same problem if the penalty is inﬁnite but is a Lagrangian relaxation otherwise. Computing a local search as A. k-Center a subroutine, starting from the larger solution, and gradually In k-center, the goal is to select k points so as to minimize increasing the penalty for additional medians will yield the the maximal distance from any given point to its cluster center. desired approximation. The idea of their local search is to ﬁnd A good way to think of this is as setting up Wireless Access a set of successor states in such a way as to have a constant Points and minimizing the broadcast radius for all served approximation among them. Because the penalty is increased points. It is NP-Hard to approximate k-center to a constant factor better than two [22], and thus this result is tight. For possible states) or on solution quality (as there are examples the optimization version, minimizing the maximum radius is on which it can be arbitrarily bad) [2]. Initial attempts by the NP-Hard [38]. Theory community to resolve this included exact algorithms Dyer and Frieze [16] give a simple heuristic for k-center. [23] and arbitrarily accurate approximations [21], although They take the point with the heaviest weight as the ﬁrst k-means remained in regular use. center and each successive center is chosen greedily as the Recently, two independent results attempted to solve this point whose distance to its closest center is maximal. This impasse by small modiﬁcations to k-means in order to achieves an approximation ratio of 3, and if the ratio of the provide it with provable solution quality guarantees and faster largest weight to the smallest weight is less than two, the runtimes, both provably and in practice. approximation factor is one plus the weight ratio. Note that this The ﬁrst is due to Ostrovsky et al [39]. They show that algorithm also works if k-center is cast as an online problem k-means does well when a good clustering exists for the with k arriving online, as with online k-median [35] parameters given: that is, the data naturally has a clustering Hochbaum and Shmoys [22] provide the best possible for the given value of k. They give a method for seeding the constant-factor approximation to the metric k-Center problem. Lloyd algorithm that is similar to the k-center solution of [16] Theirs is a 2-approximation on the radius, uses an exact and show that only one iteration of local search is necessary number of centers, and is based on an observation from linear to provide a constant bound for solution quality on such data. programming duality. The observation is that a dominating set This is naturally faster than Lloyd’s algorithm as no additional of size k in the square of the graph corresponds to a k-center iterations are necessary. solution in the original. This, too, is NP-Hard to compute, The second, an algorithm named k-means++, is due to but a feasible dominating set can be computed from a strong Arthur and Vassilvitskii [2]. In this, the starting solution state is stable set, which can be approximated in a greedy fashion. chosen carefully, rather than randomly as it was in k-means, Charikar et al [13] present a one-pass solution for k- before using traditional k-means for the rest of the algorithm. Center with outliers that, with high probability, produces an The ﬁrst center is chosen uniformly at random and each α-approximation on k-center with outliers, approximating the successive center is chosen with probability proportional to number of outliers by a factor of (1 + ε)2 . It is based on the the square of its proximity to its nearest center. This is known same ideas as their solution to k-median with outliers (see as D2 weighting. Choosing D2 weighting alone - without above). It requires, as input, an algorithm that can solve k- the local search step - guarantees an approximation factor of center with outliers to approximate within a factor of α and Θ(log k), although the practical solution quality is improved be exact on the number of outliers; such an algorithm is given by the local search. It can be shown that if this is run 2k in [12] and is described above. The advantage to using this times, it is likely for the best solution found to be within a algorithm instead of the one being used as input is that this is constant factor of the optimal. Their empirical results on both faster with a loss in approximation factor only to the quantity synthetic and natural datasets showed k-means++ is a good of centers used. improvement in both running time and solution quality over Another paper of Charikar et al [12] gives an O(n3 ) time 3- k-means. approximation to k-Center with outliers. The k-center problem with outliers cannot be approximated to better than 3 unless C. Facility Location P=NP. Facility Location is similar in goal to k-median, but without B. k-means an explicit bound on the number of medians that may be The k-means problem is very similar to k-median, except selected. Instead, each point that isn’t a median pays a cost to the objective function to minimize is the sum of squares for connect to its nearest median, as before, and any point may be the distance from each point to its center. designated as a median by paying a given cost. This cost may By far, the most common technique used in practice to solve or may not be uniform across all points. This can be thought k-means is due to Lloyd in 1982 [30]. The technique is similar of as warehouses for distributing goods to a chain of stores. in idea to local search in that k means are selected randomly Mettu and Plaxton [35] give an ofﬂine facility location and the appropriate clustering for the state is computed. Rather solution using the model of hierarchically greedy that was than search all p possible swaps, however, the next solution used in their solution to online k-median. Theirs is a 3- state is the center of mass of the k clusters, taking advantage approximation and runs in time O(n2 ). of the closed form nature of the k-means problem. Those Charikar and Guha [10] provide a local-search based solu- centers are then the basis for a re-clustering, and the process ˜ 2 tion to approximate facility location within 2.414 + ε in O( n ) ε 2 repeats itself until the solution stabilizes. The solution could, time or to get an approximation tradeoff of (1 + γ, 1 + γ ) in principle, take a great while to converge, but in practice it on facility cost versus service cost. This is near optimal for runs very quickly. tradeoffs, as no algorithm can achieve a better tradeoff than 1 It is perhaps confusing that Lloyd’s algorithm - referred to (1+γ, 1+ γ ). Combining [25] and another previous result and generally as just k-means - remains in practice despite no adding their own observation of scaling yields an approxima- provable guarantees on either the running time (beyond the k n tion ratio of 1.728. The combination is important, because no algorithm based solely on the primal dual technique of [25] Recently, Kannan et al [27] proposed a new bicriteria can do better than 3-ε. formulation to determine the quality of a clustering solution. The current best-known approximation for facility location Now, the input becomes a similarity graph for the data, with is a 1.519 factor due to Mahdian et al [33]. The approach used higher edge weights corresponding to vertices that are more is an integration of many approaches used for both k-median similar. A k-clustering partitions the graph into k components, and facility location. ideally in such a way that the most similar points are in the Meyerson [36] provides online algorithms for facility lo- same connected components. In this, we can judge the quality cation under different models. If the facilities have uniform of a clustering by the cuts it creates; we want to measure the costs, then the arrival of each point prompts a decision: shall size of the cut relative to the sizes of the pieces it creates. The we build a facility here? He proposes randomized decision quality of a clustering is the minimum quality of the clusters making for this case. If it would be more expensive to connect it creates. this point to any currently open facility, then we open one This causes a problem in that all vertices are equally im- here. Otherwise, we open one here with probability equal to portant. We would rather give importance to vertices that have the ratio of the cost to connect to the nearest facility to the many group similarities. As such, we measure the conductance cost to open a facility here. If the points arrive in a random of the cut as the ratio of the cost of cut to the similarity that order, this is constant competitive. However, if the adversary the members on one side of the cut share with the graph. may choose the order in which the points arrive, the algorithm As with k-median, a few outliers can cause problems. is O(log n)-competitive. Instead of explicitly choosing to exclude a fraction a priori, If facilities can have non-uniform costs, the above algorithm we will judge a clustering by a second criterion: the fraction will no longer work, as the current location may be arbitrarily of the total weight not covered by the clusters. This allows expensive. Instead, the metric space and associated facility us to exclude outliers by paying a cost, with a smaller cost costs are given in advance, and when we choose to open corresponding to data that is farther from our desired clusters, a facility, we may do so anywhere. We must still remain and thus more of an outlier. competitive on the points mentioned so far, so we cannot The problem statement is then to be given the similarity simply solve the ofﬂine version and call it an online solution. graph and one of the two parameters: either minimize outlier By rounding the facility costs to powers of two and considering cost while retaining the desired quality or maximize the quality multiple choices of where to open the next facility, including without excluding more than the given budget for outliers. not doing so, he gets a constant competitive algorithm for randomized arrival order and O(log n) for the adversarial case. ACKNOWLEDGMENTS For both cases, it can be shown that no online algorithm I am grateful to Adam Meyerson for providing the starting can be constant-competitive for adversarial order. seeds of papers to read. From these, I was able to ﬁnd some The online solution is also useful as a solution to the ofﬂine initial results and additional related papers. I am also grateful problem. On its own, it can provide a constant-approximation for his feedback on successive drafts of this paper. in O(n2 ) time by shufﬂing the input points and then processing each point in the new order as though it were an online R EFERENCES arrival. While the approximation factor delivered is worse than the guarantees made by other algorithms, this is much [1] Arora, S., Raghavan, P., and Rao, S. Approximation schemes for Eu- faster. Furthermore, it can be used as a subroutine to local clidean k-median and related problems. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 106-113, 1998. search-based solutions in order to start the algorithm at a good [2] Arthur, D. and Vassilvitskii, S. k-means++: The Advantages of Careful solution. When applied to [10], described above, the same √ Seeding. Proceedings of the 15th Annual ACM-SIAM Symposium on 2 1 + 2 + ε guarantee can be made in O( n ) time. ε Discrete Algorithms. 2007. [3] Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Fotakis [19] expanded on this result to show that no random- Pandit, V. Local Search Heuristics for k-Median and Facility Location log n ized algorithm can do better than Ω( log log n ) on any metric Problems. In Proceedings of the 30th Annual ACM Symposium on Theory space. He also gave a deterministic version of Meyerson’s of Computing. 2001. [4] Babcock, B., Datar, M., Motwani, R., and O’Callaghan, L. Maintaining algorithm and showed that it achieves a competitive ratio of Variance and k-Medians Over Data Stream Windows, 2003 PODS. log n O( log log n ). [5] Bartal, Y. Probabilistic approximations of metric spaces and its algo- rithmic applications. In IEEE Symposium on Foundations of Computer VIII. OTHER N OTIONS OF C LUSTERING Science, pages 184-193, 1996. [6] Bartal, Y. On approximating arbitrary metrics by tree metrics. Proceedings All of the algorithmic approaches discussed thus far use a of the 30th Annual ACM Symposium on Theory of Computing. 1998. mathematical formula to evaluate the quality of the clustering [7] Baumol, W.J. and Wolfe, P. A warehouse location problem. Operations solutions produced. These work well in general and for many Reserach. 1958. [8] Bradley, P.S., Fayyad, U.M., and Mangasarian, O.L. Mathematical Pro- types of data, although each has examples for which the cost gramming for Data Mining: Formulations and Challenges. INFORMS: metric is easily fooled; that is, there is a natural and correct Journal of Computing, special issue on Data Mining. 1998. way to group the example data - that is the point of clustering [9] Charikar, M., Chekuri, C., Goel, A., Guha, S. Rounding via Trees: Deterministic Approximation Algorithms for Group Steiner Trees and after all - which the problems, even their exact solution, fail k-median. Proceedings of the 30th Annual ACM Symposium on Theory to ﬁnd. of Computing. 1998. [10] Charikar, M. and Guha, S. Improved Combinatorial Algorithms for [37] Meyerson, A., O’Callaghan, L. and Plotkin, S. A k-median Algorithm the Facility Location and k-Median Problems. Proceedings of the 40th with Running Time Independent of Data Size. Journal of Machine Annual IEEE Conference on Foundations of Computer Science. 1999. Learning, Special Issue on Theoretical Advances in Data Clustering ´ [11] Charikar, M., Guha S., Tardos, E., and Shmoys, D. A Constant Factor (MLJ). 2004. Approximation Algorithm for the k-Median Problem. Journal of Com- [38] Mishra, N., Oblinger, D., and Pitt, L. Sublinear Time Approximate puter and System Sciences. 2002. Originally appeared in Proceedings of Clustering. Proceedings of the 9th Annual ACM-SIAM Symposium on the 31st Annual ACM Symposium on Theory of Computing. 1999. Discrete Algorithms. 2001. [12] Charikar, M., Khuller, S., Mount, D., and Narasimhan, G. Algorithms [39] Ostrovsky, R., Rabani, Y., Schulman, L., and Swamy, C. The effective- for Facility Location Problems with Outliers. In Proceedings of the 12th ness of Lloyd-tpye methods for the k-Means problem. Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms. 2001. 47th Annual Symposium on Foundations of Computer Science. 2006. [13] Charikar, M., O’Callaghan, L., and Panigraphy, R. Better Streaming ´ [40] Pal, M., Tardos, E., Wexler, T. Facility location with nonuniform hard Algorithms for Clustering Problems. Proceedings of the 35th Annual ACM capacities. Proceedings of the 42nd Annual Symposium on Foundations Symposium on Theory of Computing. 2003. of Computer Science. 2001. [14] Chen, K. A constant factor approximation algorithm for k-median [41] Scott, A.J. Operational analysis of nodal hierarchies in network systems. clustering with outliers. Proceedings of the 19th Annual ACM-SIAM Operations Research Quarterly. 1971. Symposium on Discrete Algorithms. 2008. ´ [42] Shmoys, D., Tardos, E., and Aardal, K. Approximation algorithms [15] Czumaj, A. and Sohler, C. Sublinear-Time Approximation Algorithms for facility location problems. Proceedings of the 29th Annual ACM for Clustering via Random Sampling. Random Structures and Algorithms. Symposium on Theory of Computing. 1997. 2007. [43] Stollsteimer, J.F. A working model for plant numbers and locations. [16] Dyer, M. and Frieze, A.M. A simple heuristic for the p-center problem. Journal of Farm Economics. 1963. Operations Research Letters. 1985. [44] Tamir, A. An O(pn2 ) algorithm for the p-median and related problems [17] Fakcharoenphol, J., Rao, S., and Talwar, K. A tight bound on approxi- on tree graphs. Operations Research Letters. 1996. mating arbitrary metrics by tree metrics. Proceedings of the 35th Annual ACM Symposium on Theory of Computing. 2003. [18] Feldman, E., Lehrer, F.A., and Ray, T. L. Warehouse location under continuous economies of scale. Management Science. 1966. [19] Fotakis, D. On the Competitive Ratio for Online Facility Location. International Colloquium on Automata, Languages and Programming. 2003. [20] Guha, S., Mishra, N., Motwani, R., and O’Callaghan, L. Clustering Data Streams. Proceedings of the 41st Annual IEEE Conference on Foundations of Computer Science. 2000. [21] Har-Peled, S. and Mazumdar, S. Coresets for k-means and k-median clustering and their applications. Proceedings of the 36th Annual ACM Symposium on Theory of Computing. 2004. [22] Hochbaum, D. S., and Shmoys, D. B. A Best Possible Heuristic for the k-Center Problem. Mathematics of Operations Research. 1985. [23] Inaba, M., Katoh, N., and Imai, H. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. SCG ’94: Proceedings of the tenth annual symposium on Computational geometry. 1994. [24] Indyk, P. Sublinear Time Algorithms for Metric Space Problems. Pro- ceedings of the 31st Annual ACM Symposium on Theory of Computing. 1999. [25] Jain, K. and Vazirani, V. Approximation Algorithms for Metric Facility Location and k-Median Problems Using the Primal-Dual Schema and Lagrangian Relaxation. Journal of the ACM. 2001. [26] Jain, K. and Vazirani, V. Primal-dual approximation algorithms for metric facility location and k-median problems. Proceedings of the 40th Annual Symposium on Foundation of Computer Science. 1999. [27] Kannan, R., Vempala, S., and Vetta, A. On Clusterings: Good, Bad, and Spectral. Proceedings of the 41st Annual IEEE Conference on Foundations of Computer Science. 2000. [28] Keuhn, A. A. and M.J. Hamburger. A heuristic program for locating warehouses. Management Science. 1963. [29] Korupolu, M., Plaxton, C., and Rajaraman, R.. Analysis of a local search heuristic for facility location problems. Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms. 1998. [30] Lloyd, S.P. Least squares quantization in pcm. IEEE Transactions on Information Theory. 1982. [31] Lin, J. and Vitter, J.S. ε-Approximations with Minimum Packing Con- straint Violation. Proceedings of the 24th Annual ACM Symposium on Theory of Computing. 1992. [32] Lin, J. and Vitter, J.S. Approximation Algorithms for Geometric Median Problems. IPL. 1992. [33] Mahdian, M., Ye, Y., and Zhang, J. Approximation Algorithms for Metric Facility Location Problems. SIAM Journal on Computing. 2006. [34] Manne, A.S. Plant Location under economies-of-scale-decentralization and computation. Management Science. 1964. [35] Mettu, R.R. and Plaxton, C.G. The Online Median Problem. Proceedings of the 41st Annual Symposium on Foundations of Computer Science. 2000. [36] Meyerson, A. Online Facility Location. IEEE Symposium on Founda- tions of Computer Science. 2001.

DOCUMENT INFO

Shared By:

Categories:

Tags:
approximation algorithms, approximation algorithm, metric spaces, Machine learning, Computer Science, Complexity Theory, Theoretical Computer Science, Carnegie Mellon University, Steiner Tree, Dynamic programming

Stats:

views: | 98 |

posted: | 3/27/2011 |

language: | English |

pages: | 10 |

OTHER DOCS BY nikeborome

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.