VIEWS: 3 PAGES: 25 POSTED ON: 10/5/2011
Mining Rank-Correlated Sets Of Numerical Attributes. Index Introduction. Need for finding new measures for mining numerical datasets. – problems with previous association rules with discretization methods. – Discretization example and its disadvantages. Proposed Methods and concepts. Rank and Rank correlation measures. – What is Rank and Rank correlation? – Introducing New Rank correlation Measures. – Observations. – Defining the support for new measures. – Example Dataset. – Calculating the Rank correlation measure values on the example dataset. – Properties of the new Rank correlation measures. Categorical Attribute combined with Numerical attributes. – Explaining the new measures on categorical and mixed data. Defining confidence and Association Rules with new support measures. – Example and explanation. Finding Ranks with attribute value ties in the dataset. – Defining new support measures with average rank values. Algorithms for finding out frequent itemsets using new Rank based support measures – Explicit pair method for converting the data set to normal one. – Using Algorithm with Special Indices. – Explaining search traversal techniques Range-tree and Kd-tree. Experimental Evaluation. – Results of applying new support measures on a meteorological data. – Performance evaluation for the newly found support measures. Introduction The motivation for the research reported upon in this paper is an application where we want to mine frequently occurring patterns in a meteorological dataset containing measurements from various weather stations in Belgium over the past few years. Each record contains a set of measurements (such as temperature or pressure) taken in a given station at a given time point, together with extra information about the stations (such as location or altitude). The classical association rule framework [1], however, is not adequate to deal with numerical data directly. Most previous approaches to association rule mining for numerical attributes were based on discretization. For mining the rank correlated numerical datasets three new support measures are proposed. Their application on the real time meteorological dataset and according performance evaluations are also presented. Discretization and Concept hierarchy Discretization • reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Concept hierarchies • reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior) E.g. If you have a attribute Age with values (22,23,24,35,36,37,55,56,57).Applying discretization and breaking this into intervals we get, Age (Young(20…35), MiddleAge(35….55), Old(55….80)) here , Young gives you (12,13,24) MiddleAge (35,36,37) Old (55,56,57) Discretization Example Table People RecordID Age Married Status NumCars 100 23 NO 1 200 25 YES 1 300 29 NO 0 400 34 YES 2 500 38 YES 2 Minimum Support = 40 % ,Minimum Confidence = 50 % Rules Support Confidence (Age: 30…..39) and (married: Yes) => NumCars : 2 40 % 100 % (NumCars: 0….1) => (Married: No) 40 % 66.66 % The above simple approach (discretization) to the quantitative database has two major drawbacks: “MinSup”: If the number of intervals for a quantitative attribute is large, the support for any single interval can be low. Hence without using larger intervals ,some rules involving this attribute may not be found because they lack minimum support. “MinConf”: There is some information lost whenever we partition values into intervals. Some rules may have minimum confidence only when an item in the antecedent consists of a single value (or a small interval). This information loss increases as the interval sizes become larger. For example, in Figure 2, the rule (NumCars: O) => (Married: No)” has 100 % confidence and 20 % Support. But if we had partitioned the attribute NumCars into intervals such that 0 and 1 cars end up in the same partition, then the closest rule is (NumCars: O.. 1) => (Married: No)”, which only has 66.67 % confidence and 60 % support. Conclusion: As the interval size increases the confidence decreases and support increases and also vice versa. A “catch-22” situation– if the intervals are too large, some rules may not have minimum confidence; if they are too small, some rules may not have minimum support. Disadvantages of Discretization Discretization disadvantages can be listed as: First of all it always incurs an information loss, since values falling in the same bucket become indistinguishable and small differences in attribute value become unnoticeable. On the other hand, very small changes in values close to a discretization border may cause unjustifiably large changes in the set of active rules. If there are too many Discretization intervals, discovered rules are replicated in each interval making the overall trends hard to spot. It is also possible that rules will fail to meet the minimum support criterion when they are split among many narrow intervals. What is Rank and Rank Correlation measures? Rank: Let D be the database and H be its header i.e. set of attributes. The rank of a value of a numerical attribute in A (t.A), is the index of the value of the attribute in the database when its ordered ascending w.r.t A. That is the smallest value of t.A in A gets a rank 1,etc.Consider following table, Person Height Weight A 5.5 53 B 5.7 55 Person A B C D E F G H C 5.8 45 Rank by height 1 2 3 4 5 6 7 8 D 5.9 50 Rank by weight 3 4 1 2 5 7 8 6 E 6.0 58 F 6.1 65 Same table represented with Rank Values. G 6.2 68 H 6.3 60 Sample Table Person Rank Correlation: “Rank correlation” is the study of relationships between different rankings on the same set of items. It deals with measuring correspondence between two rankings, and assessing the significance of this correspondence. Calculating the Rank Correlation Coefficients is described in the next section. Assumption: For most part of this paper it is assumed that all values of numerical attributes are Distinct . Rank correlation measures Kendall’s ┬ Spearman’s F Spearman’s ρ 1.Observatios for ρ and F: When two attributes are highly positively correlated, the ranks of their values will be close within most tuples. Hence the sums ∑t ε D|r (t.A) – r(t.B))2| and ∑t ε D|r (t.A) – r(t.B)) will be small if A and B are correlated. On the other hand, if A and B are uncorrelated, the expected absolute difference between the ranks of A and B is (|D| - 1)/2.resulting in very high values for these sums. The formulas are chosen in such a way that they are 1 for total positive correlation (e.g. A=B) and -1 for total negative correlation (e.g. A and B are in reverse order) 2.Observations for ┬ : The formula actually gives you the probability that a randomly chosen pair of tuples (s, t), s.A < t.A and s.B < t.B. In this case too the formula is chosen such that it is 1 for total positive correlation and -1 for total negative correlation. Due to this connection to the probability this formula is natural than the other two. Measuring the support for Numerical attributes By using the above coefficients following support measures are proposed. where I is set of numerical attributes. suppρ and suppF : aggregating the difference between the maximum and minimum rank values. supp┬, : all pairs of tuples (s,t) are taken and the number of pairs such that s comes before t for all attributes. This number is then divided by the maximal number of such possible pairs. Example Weather Dataset: Consider the following example Weather dataset; Where, W = Wind Speed, P = Pressure, T = Temperature, H = Humidity. The numbers in bracket indicate the rank of the value within the attribute. In order to count the supp┬ ({T , P , H }), we need to count the number of pairs of tuples such that the first tuple is smaller than the second tuple in T, P and H. e.g. the pair (t5 , t1) is such a pair, and (t2 , t3) is not. As there are 6 pairs of tuples that are in order [ (t5 , t3) (t5 , t2) (t5 , t1)] and [t4 , t3) (t4 , t3) (t4 , t3)], supp┬ is 6/10. • supp┬ is very complicated to compute. Properties: 1. Average supp┬ : Let A1,A2,A3,A4….Ak be k statistically independent numerical attributes. Then the expected or average value of the supp┬ is given by, 2. Support for Kendall‟s formula is never greater than the support calculated by Spearman‟s Footrule. The property “Support for a super set can not be greater than its sub set” holds true for those support measures too. This property can be stated as: 0 <= suppm (J) <= suppm (I) <= 1 This property can be used to calculate the upper bound for supp┬, because in contrast to the supp┬ itself it is easy to compute. Categorical attributes The above definitions of support are also applicable to ordinal attribute as their domains are ordered. Integrating categorical attributes with numerical attributes : Categorical attributes will be used to define the context in which the support of numerical attribute is computed. Example: We have attributes related to weather data. Then a sample rule, “In measurements of Antwerp and Brussels, the temperature and wind speed in Antwerp are higher then in Brussels in 70 % of the cases” Can be given as, Association rules As we have defined the support for categorical attributes we can easily define the confidence for the rule. The interpretation of the above association rule with confidence c is, for all pairs of tuples (t1 ,t2)with t1 satisfying J1, and t2 satisfying J2, if t1 is smaller than t2 in I1,then there is a chance with confidence c that t1 is also smaller than t2 in I2 Example: Consider the rule, In Antwerp if the wind speed in one measurement is greater than in another measurement ,then there is a probability of 65% that the cloudiness will be higher as well. Measuring Ranks with ties At the start of this presentation we had a assumption that all numerical values of an attribute are Distinct. That wont be the case in the real time data. This happens with the Ordinal data too. e.g an attribute Cloudiness that can take values (low, medium, High). For this attribute at least 1/3rd of the pairs will be tied. Modifying Support Measures: supp┬: For our support measure supp┬ , we can just keep our original definition; that is, if two tuples have a tie on an attribute of I, they won‟t contribute to suppt (I). Suppρ & SuppF: i.e., suppose a value t.A is repeated m times in attribute A, and there are p tuples with an A- value smaller than t.A. Then, the rank of value t.A can have any of the following values 1.In the range of p + 1, . . . , p + m. 2.The average = art.A = (p + 1 + m)/2 3.The maximal rank = r.A = p + m. Assigning the average rank to tied values, is the most natural choice. But for the minimal and maximal rank, this sum respectively decreases and increases, hence introducing a bias. For rankings with ties Suppρ(I) & SuppF(I) The definition of supp┬ does not change For ranks with no ties these formulas get reduced to the original ones. Algorithms 1. As only ranks are compared instead of the actual values in the database, we first transform the database by replacing every attribute value by its rank w.r.t. this attribute, which can be done in O(|H||D| log (|D|)) time. 2. Generate all sets of attributes satisfying the minimum support threshold, 3. In the case of supp., or suppF , it can be done by the standard level-wise search in which candidate itemsets are generated only when all of its subsets are known to be frequent. Counting the supports requires only a single scan through the data in every iteration. 4. In the case that supp┬ is used as support measure, the mining task becomes inherently more difficult since the number of pairs of transactions is quadratic in the size of the database. Explicit creation of pairs of transactions A brute force solution to the presented mining problem, is to explicitly combine all pairs of transactions from the original database D, and create a Boolean database D', such that With this new database we can apply the general methods for finding frequent itemsets. So the problem is reduced to standard frequent itemset mining on the new database D„. • Disadvantages: 1.Size of the new database D„ ,it is quadratic in the size to the original database. 2. Time needed to create it. 3. So it can be applied to only small databases. The advantage: it is most of the time very efficient in combination with depth-first frequent itemset mining algorithms because all pairs that do not support an itemset will be removed from its branch Such fine level of pruning will not be possible in the other approaches. Using Spatial Indices Spatial Indices: These are data structures especially designed to allow for searching in multidimensional data and to allow searches in several dimensions simultaneously. When the database is too large and the brute force method is not applicable, we need an efficient mechanism to count supp┬ for all candidate itemsets. For a given attribute set I, this comes down to the following nested for: Disadvantage: 1.The double loop is not feasible as it performs a quadratic operation for each candidate set.If we can replace this inner for loop by some other search technique to find out the smaller values in I this would become an efficient technique. Optimization by using support property of supp┬ and suppF Recall the property for the newly found support measures supp┬ and suppF, As suppF can be computed a lot more efficiently compared to supp T , we will first compute it for every candidate itemset. In case it gives a value under the minimum support threshold, we can already prune the itemset from the search space without computing the more expensive supp ┬ . After generating the frequent itemset our next task is to prune the itemsets which are not in the support constraints. Generating all candidate itemsets in reverse order using “Reverse Depth First” ,gives us the best result because, it is guaranteed that all subsets of a given candidate itemset are generated before the candidate itself. E.G. the subsets of { a ,b , c, d} will be generated in the following manner, {d} {a} {c} {a, d} {c, d} {a, c} {b} {a, c, d} {b, d} {a, b} {b, c} {a, b, d} {b, c, d} ▲ ► Kd-tree and Range-tree are the two search traversal techniques used for the above process. Experimental Evaluation Rules found in the weather data : In the table, the altitude of the sun (solar alt), the amount of precipitation (precip), the amount of cloudiness (cloud), and the wind speed (w speed). Association rules found in weather data of one station in Brussels. • The first rule simply claims that in 66% of the cases, the temperature is higher if the altitude of the sun is higher. The second rule indicates that if it is raining harder, then there is probability of 64% that the cloudiness is bigger as well. Intuitively, one would expect these confidences would be higher. • The last three rules indicate that higher cloudiness or higher wind speed positively influence the amount of rain, and apparently, with both higher cloudiness and wind speed this effect becomes even stronger. Indeed; for two randomly selected tuples, the probability of having a higher precipitation is only 32%. • When, however, we know that the cloudiness in the first tuple is higher, or the wind speed in the first tuple is higher, or both, the probability of the first tuple having a higher precipitation increases to respectively 48%, 44%, and 57%. Performance evaluation of supp┬ Data and Resources Used: 1.Real life meteorological dataset and a number of benchmark dataset. 2.All algorithms implemented in C programming language. 3.Tested on 2.2 GHz Opteron system with 2GB RAM. 4.supp┬ is more applied than the other two measures. Observations: 1.Implementation of Explicit pairs ran out of memory for 1000 records. 2.Kd-trees incur almost no memory overhead and fits in main memory. 3.Range trees uses a lot of memory and scaled hundreds of thousands of records depending on t he minsup parameter. Conclusions: 1. Overall Range tree method is the most efficient for not too low minimal support. Because then it generates a lot of itemsets and runs out of memory. 2.For 1% minimum support Kd trees achieve the best performance because they incur almost no memory loss. 3.Explicit pairs method can not handle data of 10000 records at all because the generated 49 995 000 new records do not fit into the memory. Performance of suppp and suppF Mining patterns in both these cases is much more efficient than in supp┬. But both these methods are very sensitive to minimum support level chosen. e.g. for suppp the value of minimum support must be very low, in the range of tenth of millionth of percent. It shows that theoretical maximum value is rarely achieved. Conclusions: 1.Three new support measures for continuous attributes based on rank methods are introduced. 2. Discretization required previously for numerical data is removed. 3. Algorithms are presented for mining frequent itemsets based on those definitions. 4. Tests results performed on real world meteorological data are presented and the efficiency of the algorithms have been verified on large databases. Questions