Docstoc

2-Datamining_Edward

Document Sample
2-Datamining_Edward Powered By Docstoc
					Data Mining: How Hard is it to find the Nuggets?
© Edward J. Wegman

Complexity
Descriptor
Tiny Small Medium Large Huge

Data Set Size in Bytes
102 104 106 108 1010

Storage Mode

Massive

1012

Piece of Paper A Few Pieces of Paper A Floppy Disk Hard Disk Multiple Hard Disks e.g. RAID Storage Robotic Magnetic Tape

The Huber Taxonomy of Data Set Sizes

Complexity
O(n 1/2) O(n) O(n log(n)) O(nc) O(n2) Plot a scatter plot Calculate means, variances Calculate fast Fourier transforms Solve multiple linear regression Solve most clustering algorithms.

Complexity
N u m b er of O p eration s for A lgorith m s of V ariou s C om p u tation al C om p lexities an d V ariou s D ata S et S izes n tiny sm all m edium large huge n
1/2

n 10 10 10 10 10
2 4 6 8

n log(n) 2x10 4x10 6x10 8x1 0 10
11 2 4 6 8

n

3/2 3 6 9

n

2 4 8

10 10 10 10 10
2 3 4 5

10 10 10 10 10

10 10 10 10 10

12 16 20

12 15

10

Complexity
C o m p u ta tio n a l F e a s ib ility o n a P e n tiu m P C 1 0 m e g a flo p p e rfo rm a n c e a s s u m e d
n tin y sm a ll m e d iu m la rg e huge n
1 /2 -6 3 /2 2

n 10 se co n d s .0 0 1 se co n d s .1 se co n d s 10 se co n d s 1 6 .7 m in u te s
-5

n lo g (n ) 2 x1 0 se co n d s .0 0 4 se co n d s .6 se co n d s 1 .3 m in u te s 2 .7 8 h o u rs
-5

n

n

10 se co n d s 10 se co n d s .0 0 0 1 se co n d s .0 0 1 se co n d s .0 1 se co n d s
-5

.0 0 0 1 se co n d s .1 se co n d s 1 .6 7 m in u te s 1 .1 6 d a ys 3 .1 7 ye a rs

.0 0 1 se co n d s 10 se co n d s 1 .1 6 d a ys 3 1 .7 ye a rs 3 1 7 ,0 0 0 ye a rs

Complexity
C o m p u ta tio n a l F e a s ib ility o n a S ilic o n G ra p h ic s O n yx W o rk s ta tio n 3 0 0 m e g a flo p p e rfo rm a n c e a s s u m e d n tin y sm a ll m e d iu m la rg e huge n
1 /2 -8

n 3 .3 x1 0 se co n d s 3 .3 x 1 0 se co n d s 3 .3 x1 0 se co n d s .3 3 se co n d s 33 se co n d s
-3 -5 -7

n lo g (n ) 6 .7 x1 0 se co n d s 1 .3 x1 0 se co n d s .0 2 se co n d s 2 .7 se co n d s 5 .5 m in u te s
-4 -7

n

3 /2 -6

n

2 -5

3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s
-4 -5 -6 -7

3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s 3 .3 se co n d s 55 m in u te s 3 8 .2 d a ys
-3

3 .3 x1 0 se co n d s .3 3 se co n d s 55 m in u te s 1 .0 4 ye a rs 1 0 ,4 6 4 ye a rs

Complexity
C o m p u ta tio n a l F e a s ib ility o n a n In te l P a ra g o n X P /S A 4 4 .2 g ig a flo p p e rfo rm a n c e a s s u m e d n tin y sm a ll m e d iu m la rg e huge n
1 /2 -9

n 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s .0 2 4 se co n d s 2 .4 se co n d s
-4 -6 -8

n lo g (n ) 4 .8 x1 0 se co n d s 9 .5 x1 0 se co n d s .0 0 1 4 se co n d s .1 9 se co n d s 24 se co n d s
-6 -8

n

3 /2 -7

n

2 -6

2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s
-5 -6 -7 -8

2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s .2 4 se co n d s 4 .0 m in u te s 6 6 .7 h o u rs
-4

2 .4 x1 0 se co n d s .0 2 4 se co n d s 4 .0 m in u te s 2 7 .8 d a ys 761 ye a rs

Complexity
C o m p u ta tio n a l F e a s ib ility o n a T e ra flo p G ra n d C h a lle n g e C o m p u te r 1 0 0 0 g ig a flo p p e rfo rm a n c e a s s u m e d

n tin y s m a ll m edi um la rg e huge

n

1 /2 -1 1

n
-1 0

n lo g (n )
-1 0

n

3 /2 -9

n

2 -8

10 seconds 10 seconds 10 seconds 10 seconds 10 seconds
-7 -8 -9 -1 0

10 2x10 seconds seconds 10 4x10 seconds seconds 10 6x10 seconds seconds 10 8x10 seconds seconds .0 1 .1 seconds seconds
-4 -4 -6 -6 -8 -8

10 seconds 10 seconds .0 0 1 seconds 1 second 1 6 .7 m in u te s
-6

10 seconds 10 seconds 1 second 2 .8 h o u rs 3 .2 y e a rs
-4

Complexity
T yp e s o f C o m p u te rs fo r In te ra c tive F e a s ib ility R e s p o n s e T im e < 1 s e c o n d

n tin y s m a ll m e d iu m la rg e huge

n

1 /2

n P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r

n lo g (n ) P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r

n

3 /2

n

2

P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r

P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r Super C o m p u te r T e ra flo p C o m p u te r ---

P e rs o n a l C o m p u te r Super C o m p u te r T e ra flo p C o m p u te r -----

W o rk s ta tio Super n C o m p u te r Super C o m p u te r T e ra flo p C o m p u te r

Complexity
T yp e s o f C o m p u te rs fo r F e a s ib ility R e s p o n s e T im e < 1 w e e k

n tin y sm a ll m e d iu m la rg e huge

n

1 /2

n P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r

n lo g (n ) P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r

n

3 /2

n

2

P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r

P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r Super C o m p u te r

P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r

T e ra flo p C o m p u te ---

Massive Data Sets
Major Issues – Complexity – Non-homogeneity Examples – Air Traffic Control – Highway Maintenance

Massive Data Sets
Air Traffic Control – 6 to 12 Radar stations, several hundred aircraft, 64-byte record per radar per aircraft per antenna turn – megabyte of data per minute

Massive Data Sets
Highway Maintenance – Records of maintenance records and measurements of road quality for several decades – Records of uneven quality – Records missing


				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:129
posted:11/12/2007
language:English
pages:13