Compressing the Native XML Database via Association Mining

Document Sample
Compressing the Native XML Database via Association Mining Powered By Docstoc
					XML

(eXtensible Markup Language XM) XML XML XML XML XML (frequent character XML XML

(frequent tag data sets) data sets) XML

XML XML 40 XML

75% XML

Compressing the Native XML Database via Association Mining
Chin-Feng Lee Chaoyang University of Technology, Chi-Ming Tang Chaoyang University of Technology

84

Abstract
XML has become a standard so that the transactional processes operate well in enterprise data exchange. However the existing database systems like relational databases provide inadequate facilities to manage the nested and ordered structures in XML documents. Therefore, there exist two important issues about the storage capacity for huge XML documents and the complexity mapping between the relational databases and XML repository. A native XML database is a solution to efficiently retrieve XML documents as basic units without requiring complicated transformation. Moreover, database compression is bound to relief the storage capacities. Hence, we use association mining techniques to compress a native XML database for solving the above problems. The frequent character data sets and frequent tag sets can be explored out and be applied to establish a set of database compression rules. The proposed method also applies dynamically mining techniques to maintain the compression rules without periodically decompressing and exploring the whole database compression again if there are any database updates. The proposed approach contributes to the native XML database both in extracting hidden information and lossless compression, respectively. The experimental results show that our compression method has powerful compression effectiveness and the static compression can reach the ratio of 75%. When we apply the porposed dynamical mining techniques, we can save 40 seconds in database compression time. Keywords: XML, Data Mining, Data Compression, Native XML Database

XML(eXtensible Markup Language) XML DTD(Document Type Definition) XML Schema

W3C

XML (Cannane et al., 2000; Lee et al., 2001; Strobel, 2002; WWW Consortum, 2005) XML XML XML (Bertino et al., 2001; Cannane

Electronic Commerce Studies 85

et al. 2000)

XML XML (Florescu and Lossmann, 1999; Fong et al., 2003)

XML

XML XML

(Arithmetic Encoding) Databases)

(Huffman Coding) (Dictionary Coding) (Knowledge DBcovery in

XML (Agrawal and Srikant, 1994; Agrawal et al., 1993; Han and Kamber, 2001) XML (frequent tag data sets) (frequent character data sets) XML

XML XML

XML

XML

86

XML(eXtensible Markup Language)
XML DTD(Document Type Definition) (tag) W3C (character data) XML

1 “ ” “ “020301” ” “ ” “ “ 2” ”

XML “ ” ”

“ “

”

“

” (character data) ” “ ” “ “ ”

1

XML

(Item) (Agrawal and Srikant, 1994) X Y [Support , Confidence] X Y (Support) (Confidence) (X Y) X

Probability(X Y) Probability(Y|X) (Minimum Confidence)

Y (Minimum Support)

Electronic Commerce Studies 87

Class Inheritance Tree(CIT) Apriori (Goh et al. 1998; Lee et al., 2006)

2001

(2001) CIT CIT

CIT (Equivalence Class)

Step1 (1)

X={X1, X2, …, Xn} Xi (Equivalence Class) CIT CIT EC={EC1, EC2, …, ECn} ECi

CIT X

(2) Step2

(Decision Equivalence Class) DECi (Condition Equivalence Class) CECi DEC2, …, DECn } CIT CIT CECi DECi Step3 (1) Xi Xi = CEC=EC-{DEC1,

Xi

t (X1, X2, X3, …, Xi-1, , Xi+1, …, Xn ) t’ (X1, X2, X3, …, Xi-1 , Xi+1, …, Xn ) X1, X2, X3, …, Xi-1 , Xi+1, …, Xn i R*B( ) R Xi = EC B( ) Byte (2)

88

t (X1, …, Xj1-1,

1,

Xj1+1, …, Xj2-1, 2, Xj2+1, …, Xjk-1,

k,

Xjk+1, …, Xn)

t (X1, …, Xj1-1, Xj1+1, …, Xj2-1 , Xj2+1, …, Xjk-1, Xjk+1, …, Xn) X1, …, Xj1-1, Xj1+1, …, Xj2-1, Xj2+1, …, Xjk-1 , Xjk+1, …Xn
k

i

ji

R*
i 1

B(

i

)

R
i

B( i)

Byte

for i = 1 to k

XML 2 XML Collection XML XML Collections DTD (Parse) XML (2001) XML Collection XML (Frequent Character Data Sets) (Frequent Tag Sets) XML XML Collection Collection XML

Electronic Commerce Studies 89

2

DTD
XML

XML
(Hierarchical n-ary

Structure) XML ( Document Tree) (Sub-element) (Parent-Child Nodes) XDB XML (Native XML Database) XDB = {C1, C2, …, Cn} XDB n Colletions Collection Ci XML X={X1, X2, …, Xm} XML DTD DTD Collection Ci (Depth-First-Search DFS) XML Xi Di-Tree (Root Node) (Tag Node) (Character Data Node) 1. 2. q Case 1. x.y.z1 q r2 XML q x.y m x.y.z2

p

p r1

x

{r1, r2, …, rm} Ym

90

x.y.zm Case 2. q 3(a) 3(b) ” ” XML “ ”

x.y x.y DTD X1 X1 D1-Tree “1” DTD “1” “ X1 “ ” Case 1 “ ” ” “ ” Case 1 “ “ ” “ “1.1.1” D1-Tree XML “ ” Collection “1” “1” XML XML 5 “ ” ” XML XML “ (Content) “ “ ”

“ “1” “

“ ” XML ” ”

“1.1” DTD “1” “1.1.1” 4 “ ”

” XML

“2”

3 (a)DTD (b) XML

Electronic Commerce Studies 91

4XML

x1
D1-Tree

5

XML XML = { i1, { 1, 2, …,
i2,

…, ik} h} EC

{ 1, XML XML

2,

…,

h}

Definition 1 Character Data)

(Equivalence Class of

EC = EC< {<

i1

,

i2

, ...,

i1 ,

i2, …,

ik >

=

i k >|

ij

ij

XML

for j = 1, 2, …, k} EC< 3.2.3>} 3 “3.2.2” XML “150”

150> = {<2.2.3, 2.2.4>, <3.2.2, “ ” “150” 2 “ ” “2.2.3” “2.2.4” “3.2.3”

92

Definition 2
k (k =1)
1

k
k (k 1 =< 1 >
2

2)

1 < 80>

=<

80> <

> 2

Definition 3
k EC

(support)
XML |EC | = < 80 > = {<1.2.2, 1.2.3>, <3.2.1, 3.2.2>} 80 > XML >| =2 min-sup

EC =EC< |EC<
80

2

Minimum-Support Threshold,

3.1 (min-sup) CF1 CS1 CF1 Table (1) Content (2) List CS1 Table (1) Content (2) List Apriori 2) CFk Table (1) Content (2) List 1 CF1 Table

1 XML

1 k CFk Table 2) k(k p q 2) p q Content (k

k(k

Electronic Commerce Studies 93

1 1 EC
2

2 2 2

1

2

1 EC
1

XML Most Significant Digit(msd) EC 1 G1, G2, …, Gm1 |Gm1| (say Gj j = 1, 2, …, m1) msd(say j) 1 j XML EC 2 G1’, G2’, …, Gm2’ msd(G1) = msd(G1’) msd(Gh) = msd(Gh’) …msd(G ) = msd(G ’) h m1 h’ m2 |EC< 1, 2>|= min{|G1|, |G1’|} + min{|G2|, |G2’|} + …+ min{|Gh|, |Gh’|} |EC< 1, 2>| min-sup < 1, 2> 2 2 k(k 2) 1 2 min-sup=2 Content 1 2 1 = <bread>, 2 = <coke cola> EC 1 msd G1={1.2.1}, G2={2.2.3}, G3={4.2.2}(m1=3) EC 2 msd G1’={1.2.2}, G2’={2.2.1}(m2=2) Most Significant Digit msd(G1)=1, msd(G2)=2, msd(G3)=4, msd(G1’)=1, msd(G2’)=2 msd(G1)=msd(G1’), msd(G2)=msd(G2’) |EC< 1, 2>|=|EC<bread,coke cola>|= min{|G1|, |G1’|}+min{|G2|, |G2’|}=1+1=2 bread, coke cola XML XML min-sup 2 2 CF2 Table Content “<bread, coke cola>” List “<1.2.1, 1.2.2>, <2.2.3, 2.2.1>” 2 EC
1

1

1 CF1 Content bread coke cola cheese 2 CF2 Content <bread, coke cola> <bread, cheese> 2

1 List 1.2.1, 2.2.3, 4.2.2 1.2.2, 2.2.1 1.2.3, 4.2.1

List <1.2.1, 1.2.2>, <2.2.3, 2.2.1> <1.2.1, 1.2.3>, <4.2.2, 4.2.1>

94

2

3

2 min-sup=2 Contetnt 1 2 1=<bread, coke cola>, 2=<bread, cheese> EC 1 msd G1={<1.2.1, 1.2.2>}, G2={<2.2.3, 2.2.1>}(m1=2) EC 2 msd G1’={<1.2.1, 1.2.3>}, G2’={<4.2.2, 4.2.1>} most significant digit msd(G1)=msd msd(G1)=1, msd(G2)=2, msd(G1’)=1, msd(G2’)=4 (G1’) |EC< 1, 2>|=|EC<bread, coke ke cola, cheese>|=min{|G1|, |G1’|}=1 <bread, coke cola, chees> XML 1 min-up

XML XML

={ { 1,
2,

i1 ,

i2 ,

…,

ik }

{ 1, XML XML

2,

…,

h}

…, EC

h}

Definition 4
EC XML

(Equivalence Class of Tag)
= EC<
i1, i2, …, ik>{<

i1 ,

i2 , …, ik >|

}

Definition 5
1

1
1 (k 2)

Definition 6
1

(support)
EC XML |EC |

Electronic Commerce Studies 95

(2001)

Metarule C

Metarule T

(k

CFk Table Content k 1) t (P1, …, Pj1-1, 1, Pj1+1, …, Pj2-1, 2, Pj2+1, …, Pjk-1, n, Pjk+1, …, Pn) t (P1, …, Pj1-1, Pj1+1, …, Pj2-1, Pj2+1, …, Pjk-1, Pjk+1, …, Pn) List P1, …, Pj1-1, Pj1+1, …, Pj2-1, Pj2+1, …Pj k-1, …Pjk+1, Pn
i

DTD

i List
i

List
k

|List|* B( i)
i1

B( i) Byte

|CD|

i

i

for i =1, 2, ..., n

XML

t1 , , , {1.1.1, 1.2.1}, {3.1.1, 3.2.1} ” “ ” “ ” “ ” “ ” ” “ ” “ ” “ ” “{1.1.1, 1.2.1}, {3.1.1, 3.2.1}”

t1’

, “ ” “ ” “ “

“ ” ”

“ “ ”

List

TFk Table (k 1) t (P1, …, Pj1-1,
1,

Content
2,

k Pj2+1, …, Pjk-1,
n,

Pj1+1, …, Pj2-1,

Pjk+1, …, Pn)

96

t (P1, …, Pj1-1, Pj1+1, …, Pj2-1, Pj2+1, …, Pjk-1, Pjk+1, …, Pn) List P1, …, Pj1-1, Pj1+1, …, Pj2-1, Pj2+1, …Pj k-1, …Pjk+1, Pn
i

DTD

List
k

|List|* List B( i)
i1

B( i) Byte

|List|

i

i

for i =1, 2, ..., n

(Heuristic Compression Method)

Apriori

3 “ ” “ “{1.1.1, 1.2.1}, {3.1.1, 3.2.1}” “ ” “ ” {3.1.1, 3.2.3}, {4.1.1, 4.2.1}” “ ” “1.1.1” “3.1.1” “RID 1” “RID 2” “RID 2” CD “{4.1.1, 4.2.1}” ” CD “{1.1.1, 1.2.2}, 1 3

“1.1.1”

“3.1.1” “RID 2”

3 Compression Rules t1 t1’ t2 t2’ List {1.1.1, 1.2.1}, {3.1.1, 3.2.1} {1.1.1, 1.2.2}, {3.1.1, 3.2.3}, {4.1.1, 4.2.1}

1. XML

(Least Penalty Heuristic Method; LPHM)

Electronic Commerce Studies 97

2. XML

(Most Penalty Heuristic Method; MPHM)

3. Method; MaxCSHM)

(Maximal Compression Spaces Heuristic

4. Method; MinCSHM)

(Minimum Compression Spaces Heuristic

Java

J2SDK 1.4.1 Intel P4-2.4G 1.5GB Microsoft Windows 2000 Professional XML DTD RAR ZIP IBM Almaden Assoc.gen IBM Assoc.gen 4 XML RAR (LPHM) (Non-Heuristic Method)

6

7

7 DTD Research Center (2005)

XML

ZIP

98

4 Assoc.gen

T D N I R

L_TLEN NTRANS L_NITEMS PATLEN CORR

<!ELEMENT transaction (itemset)> <!ELEMENT itemset (item+)> <!ELEMENT item (#PCDATA)>

6

DTD

<!ELEMENT Sale (Header, itemset)> <!ELEMENT Header (SID, name, note)> <!ELEMENT SID(#PCDATA)> <!ELEMENT name(#PCDATA)> <!ELEMENT note(#PCDATA)> <!ELEMENT itemset (item+)> <!ELEMENT item (#PCDATA)>

7

DTD

8 DTD

1000 1

XML

10% 20% 30% T20. I10. N100. R1. 8

90% 6

1 80% 90% XML 30% 30% 90% 1 74.18 %

Electronic Commerce Studies 99

( % ) (%)

8 1000

XML

5 6 1 DTD

bytes 72,768 bytes 122,252 bytes 60,661 bytes 78,230 bytes(709,462 + 72,768) (78,230)/(846,826 + 122,252 + 60,661)= 75.96(%) 10000 XML 75%

XML XML 2500 5000 7500 10000 T20. I10. N0.1k. R1 30% 2500 XML 709,462 bytes 137,364 709,462 bytes

5000

7500

100

5
2500 XML (bytes) (bytes) (bytes) (bytes) (bytes) (bytes) (bytes) (bytes) (%) (%) 846,826 709,462 137,364 709,462 5000 1683,026 1,410,092 272,934 1,410,092 7000 2,524,412 2,114,996 409,416 2,114,996 210,612 2,325,608 383,270 184,305 75.21 10000 3,371,170 2,824,351 546,819 2,824,351 289,971 3,114,332 537,114 265,827 74.61

72,768 140,466 782,230 1,550,558 122,252 255,522 60,661 122,927 75.96 75.22 75.25

RAR
9 R1. XML 6 1 1 ZIP

ZIP
30% RAR DTD A B 2 T20. I10. N100. ZIP

RAR
80 75 70 65 60 ( % 55 50 45 40

9 10 7

)

2500

5000 XML

7500

10000

DTD 30% DTD

RAR

ZIP T20. I10. N100. R1.

XML

RAR

Electronic Commerce Studies 101

ZIP 1 1 ZIP RAR 9 ZIP B 10

A

2 RAR

75 70 65 60 55 50 45 40 2500 5000 XML 7500 10000

10

DTD

RAR

ZIP

11 R1. XML DTD k(k 2)

30% 1

T20. I10. N100. 7

10000 74.61%

210

5

102

240 210 180 150 120 90 60 30 0 1000 2500 5000 XML 7500 10000

11

12 30% k(k 7 1000 LPHM
k ( b y t e s ) 1000 2500

T20. I10. N100. R1. 2)

DTD (LPHM) k(k 2500 5000 7500

2) 10000 XML

5000 XML

7500

10000

12

Electronic Commerce Studies 103

XML

(Lossless Compression) XML

< “ ” 75%

>

“ XML RAR

”

ZIP

(2001) 2001 —

R. Agrawal, T. Imielinski, and A. Swami(1993), “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C., 207-216. R. Agrawal and R. Srikant (1994), “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), 487-499. E. Bertino and B. Catania (2001), “Integrating XML and Database,” IEEE Internet Computing, 5(4), pp. 84-88. A. Cannane and H. E. Williams (2000), “A Compression Scheme for Large Databases,” Proceedings of the Australian Database Conference (ADC'2000), 22(2), 6-11. S.W., Changchien and T.C., Lu (2001)“A New Efficient Association Rules Mining Method Using Class Inheritance Tree (CIT),” Proceedings of the 12th International Conference on Information Management (ICIM 2001), Twaiwn. D. Florescu and D. Kossmann (1999), “Storing and Querying XML Data Using an RDBMS,” IEEE Data Engineering Bulletin, 22(3), 27-34.

104

J. Fong, H. K. Wong, and Z. Cheng (2003), “Converting Relational Database into XML Documents with DOM,” Information and Software Technology, 45, 335-355. C. L. Goh, K. M. Aisaka, Tsukamoto, K. Harumoto, and S. Nishio (1998), “Database Compression with Data Mining Methods,” Proceedings of the 5th International Conference on Foundations of Data OrganiPation (FODO'98), 97-106. J. Han and M. Kamber, (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann. J. W. Lee, K. Lee, and W. Kim (2001), “Preparations for Semantics-Based XML Mining,” Proceedings of the IEEE International Conference on Data Mining, 345-352. C. F. Lee, S. W. Changchien, W. T.Wang, and J. J. Shen (2006): “A Data Mining Approach to Database Compression,” Information Systems Frontier (ISF), 8(3), 147-161. M. Strobel (2002), “An XML Schema Representation for the Communication Design of Electronic Negotiations,” Computer Networks, 39, 661-680. IBM Almaden Research Center (2005), Quest Synthetic Data Generation, http://www.almaden.ibm.com/software/quest/Resources/dataset/synda ta.html. World Wide Web Consortum (2005), “Extensible Markup Language (XML) Version 1.1,” http://www.w3.org/TR/2004/REC-xml11-20040204/.

lcf@cyut.edu.t s9214612@ cyut.edu.tw