# 2-Datamining_Edward by honeytech

VIEWS: 129 PAGES: 13

• pg 1
```									Data Mining: How Hard is it to find the Nuggets?

Complexity
Descriptor
Tiny Small Medium Large Huge

Data Set Size in Bytes
102 104 106 108 1010

Storage Mode

Massive

1012

Piece of Paper A Few Pieces of Paper A Floppy Disk Hard Disk Multiple Hard Disks e.g. RAID Storage Robotic Magnetic Tape

The Huber Taxonomy of Data Set Sizes

Complexity
O(n 1/2) O(n) O(n log(n)) O(nc) O(n2) Plot a scatter plot Calculate means, variances Calculate fast Fourier transforms Solve multiple linear regression Solve most clustering algorithms.

Complexity
N u m b er of O p eration s for A lgorith m s of V ariou s C om p u tation al C om p lexities an d V ariou s D ata S et S izes n tiny sm all m edium large huge n
1/2

n 10 10 10 10 10
2 4 6 8

n log(n) 2x10 4x10 6x10 8x1 0 10
11 2 4 6 8

n

3/2 3 6 9

n

2 4 8

10 10 10 10 10
2 3 4 5

10 10 10 10 10

10 10 10 10 10

12 16 20

12 15

10

Complexity
C o m p u ta tio n a l F e a s ib ility o n a P e n tiu m P C 1 0 m e g a flo p p e rfo rm a n c e a s s u m e d
n tin y sm a ll m e d iu m la rg e huge n
1 /2 -6 3 /2 2

n 10 se co n d s .0 0 1 se co n d s .1 se co n d s 10 se co n d s 1 6 .7 m in u te s
-5

n lo g (n ) 2 x1 0 se co n d s .0 0 4 se co n d s .6 se co n d s 1 .3 m in u te s 2 .7 8 h o u rs
-5

n

n

10 se co n d s 10 se co n d s .0 0 0 1 se co n d s .0 0 1 se co n d s .0 1 se co n d s
-5

.0 0 0 1 se co n d s .1 se co n d s 1 .6 7 m in u te s 1 .1 6 d a ys 3 .1 7 ye a rs

.0 0 1 se co n d s 10 se co n d s 1 .1 6 d a ys 3 1 .7 ye a rs 3 1 7 ,0 0 0 ye a rs

Complexity
C o m p u ta tio n a l F e a s ib ility o n a S ilic o n G ra p h ic s O n yx W o rk s ta tio n 3 0 0 m e g a flo p p e rfo rm a n c e a s s u m e d n tin y sm a ll m e d iu m la rg e huge n
1 /2 -8

n 3 .3 x1 0 se co n d s 3 .3 x 1 0 se co n d s 3 .3 x1 0 se co n d s .3 3 se co n d s 33 se co n d s
-3 -5 -7

n lo g (n ) 6 .7 x1 0 se co n d s 1 .3 x1 0 se co n d s .0 2 se co n d s 2 .7 se co n d s 5 .5 m in u te s
-4 -7

n

3 /2 -6

n

2 -5

3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s
-4 -5 -6 -7

3 .3 x1 0 se co n d s 3 .3 x1 0 se co n d s 3 .3 se co n d s 55 m in u te s 3 8 .2 d a ys
-3

3 .3 x1 0 se co n d s .3 3 se co n d s 55 m in u te s 1 .0 4 ye a rs 1 0 ,4 6 4 ye a rs

Complexity
C o m p u ta tio n a l F e a s ib ility o n a n In te l P a ra g o n X P /S A 4 4 .2 g ig a flo p p e rfo rm a n c e a s s u m e d n tin y sm a ll m e d iu m la rg e huge n
1 /2 -9

n 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s .0 2 4 se co n d s 2 .4 se co n d s
-4 -6 -8

n lo g (n ) 4 .8 x1 0 se co n d s 9 .5 x1 0 se co n d s .0 0 1 4 se co n d s .1 9 se co n d s 24 se co n d s
-6 -8

n

3 /2 -7

n

2 -6

2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s
-5 -6 -7 -8

2 .4 x1 0 se co n d s 2 .4 x1 0 se co n d s .2 4 se co n d s 4 .0 m in u te s 6 6 .7 h o u rs
-4

2 .4 x1 0 se co n d s .0 2 4 se co n d s 4 .0 m in u te s 2 7 .8 d a ys 761 ye a rs

Complexity
C o m p u ta tio n a l F e a s ib ility o n a T e ra flo p G ra n d C h a lle n g e C o m p u te r 1 0 0 0 g ig a flo p p e rfo rm a n c e a s s u m e d

n tin y s m a ll m edi um la rg e huge

n

1 /2 -1 1

n
-1 0

n lo g (n )
-1 0

n

3 /2 -9

n

2 -8

10 seconds 10 seconds 10 seconds 10 seconds 10 seconds
-7 -8 -9 -1 0

10 2x10 seconds seconds 10 4x10 seconds seconds 10 6x10 seconds seconds 10 8x10 seconds seconds .0 1 .1 seconds seconds
-4 -4 -6 -6 -8 -8

10 seconds 10 seconds .0 0 1 seconds 1 second 1 6 .7 m in u te s
-6

10 seconds 10 seconds 1 second 2 .8 h o u rs 3 .2 y e a rs
-4

Complexity
T yp e s o f C o m p u te rs fo r In te ra c tive F e a s ib ility R e s p o n s e T im e < 1 s e c o n d

n tin y s m a ll m e d iu m la rg e huge

n

1 /2

n P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r

n lo g (n ) P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r

n

3 /2

n

2

P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r

P e rs o n a l C o m p u te r P e rs o n a l C o m p u te r Super C o m p u te r T e ra flo p C o m p u te r ---

P e rs o n a l C o m p u te r Super C o m p u te r T e ra flo p C o m p u te r -----

W o rk s ta tio Super n C o m p u te r Super C o m p u te r T e ra flo p C o m p u te r

Complexity
T yp e s o f C o m p u te rs fo r F e a s ib ility R e s p o n s e T im e < 1 w e e k

n tin y sm a ll m e d iu m la rg e huge

n

1 /2

n P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r

n lo g (n ) P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r

n

3 /2

n

2

P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r

P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r Super C o m p u te r

P e rso n a l C o m p u te r P e rso n a l C o m p u te r P e rso n a l C o m p u te r

T e ra flo p C o m p u te ---

Massive Data Sets
Major Issues – Complexity – Non-homogeneity Examples – Air Traffic Control – Highway Maintenance

Massive Data Sets
Air Traffic Control – 6 to 12 Radar stations, several hundred aircraft, 64-byte record per radar per aircraft per antenna turn – megabyte of data per minute

Massive Data Sets
Highway Maintenance – Records of maintenance records and measurements of road quality for several decades – Records of uneven quality – Records missing

```
To top