Visions: The Evolution of Statistics
Edward J. Wegman Center for Computational Statistics
“Prediction is very hard, especially when it‟s about the future”
Yogi Berra
(Italian-American Philosopher and Baseball Player)
Three Scientific Revolutions of the Twentieth Century
• Quantum Revolution - Unlocking the Secrets of the Atom • DNA Revolution - Unraveling the Secrets of Life • Computing Revolution - Extending Human Intellect
• Michio Kaku, CUNY Physicist and Futurist in his book VISIONS
The Fourth Scientific Revolution of the Twentieth Century
• Statistical Theory and Data Analysis
– A major impact on the ability of the Earth‟s societies to care for the billions of people and trillions of dollars that currently inhabit the planet – Modern societies could not exist without this 4th scientific revolution
An Unpopular and Perhaps Radical View
• Mathematical statistics as we know it is essentially a completed theory
– The general principles are well-developed and contemporary statistical theory of traditional parametric and nonparametric methods is largely a solved problem – That is not to say there are no valuable niches yet to be developed nor that the application of statistical methodology will not continue for the foreseeable future – Mathematical statistics occupies much the same position as Newtonian mechanics. It is a tremendously valuable tool that will continue to be applied in many practical settings and in the framework of which new specialized methods will continue to be developed
Whither Statistics ?
• What is the future of statistics and what are the new tools and techniques for statistics?
– Statistics is essentially a computational science, it is about numbers and counts and measurements and exploring these numbers and counts and measurements in order to make inferences – The computing revolution is leading in my view to a statistical revolution. Perhaps not with statistics as we conventionally understand statistics, but with data analysis and inference – There is a reasonable prospect that data analysis and inference will be subsumed under the larger mantle of computer science and that what we understand today as statistics will become a subset of a larger data analysis and inference enterprise
How Data Are Changing
TRADITIONAL STATISTICS
Small to Moderate Sample Size I.I.D. Data Sets One or Low Dimensional Manually Computable Mathematically Tractable Well Focused Questions Strong Unverifiable Assumptions in Relationships (linearity; additivity), in Error Structures (normality) Statistical Inference Predominantly Closed Form Algorithms Statistical Optimality
COMPUTATIONAL STATISTICS
Large to Very Large Sample Size Nonhomogeneous Data Sets High Dimensional Computationally Intensive Numerically Tractable Imprecise Questions Weak or No Assumptions in Relationships (nonlinearity); in Error Structures (distribution free) Structural Inference Iterative Algorithms Possible Statistical Robustness
Table 1. Comparison of Traditional and Computational Statistics
How Data Are Changing
DESCRIPTOR Tiny Small Medium Large Huge Massive DATA SET SIZE IN BYTES 102 104 106 108 1010 1012 STORAGE MODE Piece of Paper A Few Pieces of Paper A Floppy Disk Hard Disk Multiple Hard Disks e.g. RAID Storage Robotic Magnetic Tape Storage Silos
Table 2. The Huber Taxonomy of Data Set Sizes
How Data Are Changing
• Consider for example an O(n2) clustering algorithm applied to a massive data set. This would require O(1024) computations which on a teraflop computer (1012 computations per second) would require 1012 seconds or approximately 105 years. Clearly this is prohibitive. Standard ethernet operates at a maximum of 10 megabits per second. That same massive dataset would require 106 seconds or somewhat more than 1 month to transfer over standard ethernet operating at maximimal efficiency. The human eye contains approximately 107 cones. Even with the visualization capability of one observation per cone our eyes would be hopelessly overloaded. A massive dataset would require us to visualize 105 observations per cone.
•
•
How Data Are Changing
• If gigabyte and larger data sets and O(n3/2) complexity algorithms are problems, then to what extent are these factors appearing in real data? The answer is that they appear fairly commonly. Airline booking transactions, point-of-sale commercial purchases, bank transactions, and telephone call records are just a few such commercial databases that one might wish to exploit. In the scientific realm, catalogs of celestial objects, data from satellite remote sensing of the Earth, text and multimedia data exploitation from internet usage, ultrasound nondestructive evaluation data, radar data used in air traffic control, and image understanding and exploitation are just a few examples for which data quickly accumulate
into the terabyte and higher range.
•
•
How Statistics Is Evolving
Statistics, The Guardian of the Scientific Method or
Statistics: The Tool for Analyzing Data
How Statistics Is Evolving
• The focus of our discipline I believe should be on data and inferences to be made from data. • If the nature of data is changing, then the methods for analyzing and making inferences from that data must correspodingly change. • Many traditional methods are extremely valuable and will continue to be employed for the foreseeable future. • However, new data types require new methods and techniques.
How Statistics is Evolving
• • So how will statistics as a methodology and a discipline evolve? I believe many of the traditional dichotomies will become anachronisms. The Bayesian versus classical perspective will essentially disappear. Both of these approaches tend to refer to parametric techniques; these are poor at coping with really large scale data. Nonparametric versus parametric techniques still refer to model-based views of data. For many purposes models are unnecessary if the data speak in such a compelling fashion. If the data are not collected according to probabilistic sampling then both parametric and nonparametric statistical models are essentially
irrelevant except as a heuristic tool.
•
•
How Statistics Is Evolving
I hope statistics as a discipline will embrace a larger view of the field and will take data, rather than methodology, to be the fundamental common denominator of the discipline. With this view, not only traditional statistics and probability are the focus of the discipline, but also topics like data mining, scientific visualization, image analysis, pattern recognition, databases, and related computational methods become the fundamental features of the discipline.
Data Mining: An Issue for Statisticians
– “Despite … somewhat lofty definitions, DM so far has been largely a commercial enterprise. As in most gold rushes of the past, the goal is to „mine the miners‟. The largest profits are made by selling tools to the miners, rather than doing the actual mining. The concept of DM is used as a device to sell computer hardware and software.”
– Jerry Friedman, 1998
Data Mining: An Issue for Statisticians
Data mining is exploratory data analysis with little or no human interaction using computationally feasible techniques, i.e., the attempt to find interesting structure unknown a priori.
- Wegman, 1997
Data Mining: An Issue for Statisticians
• Traditional Statistical Tools
– classification and clustering – neural networks and genetic algorithms – CART – nonparametric regression – time series: trend and spectral estimation – density estimation, including the estimation of bumps and ridges
Data Mining: An Issue for Statisticians
• Other Tools
– machine learning – pattern recognition – thinning and binning – data visualization
• scintillation • saturation brushing • grand tour
Computing, Networks, Distributed Data, Data Access
• Metadata Center (MdC)
– Automated Creation of Metadata – Query and Search
• • • • Client Browser Expert System for Query Refinement Search Engine Reporting Mechanism
The Evolution of Statistics
This talk was the keynote talk at the Conference entitled New Techniques and Technologies in Statistics 98 sponsored by Eurostat, the IASC and the ISI. It was videotaped and a streaming video version is available locally at ftp://www.galaxy.gmu.edu/pub/papers/keynote.asx (keynote.asx is a redirector file and points to the streaming video server in Italy. The Windows Media Player plugin is required.)
The Evolution of Statistics
The paper entitled “Visions: The Evolution of Statistics” will be published in the journal Research in Official Statistics. It is available at
ftp://www.galaxy.gmu.edu/pub/papers/visionstheevolutionofstatistics.pdf.
This powerpoint file is available at
ftp://www.galaxy.gmu.edu/pub/papers/EvolutionStatistics.ppt
steph777 6/27/2008 |
58 |
4 |
0 |
educational
neophyteblogger 8/25/2008 |
68 |
1 |
0 |
BUZZ
NASSdocs 6/17/2008 |
139 |
0 |
0 |
legal
iamgod 9/24/2007 |
2563 |
0 |
0 |
creative
telekenetix 5/16/2008 |
106 |
6 |
0 |
educational
dargen 4/25/2008 |
103 |
1 |
0 |
educational
dargen 4/25/2008 |
125 |
2 |
0 |
educational
presentor 6/30/2008 |
125881 |
87 |
9 |
educational
hartenergy 6/10/2008 |
384 |
4 |
0 |
steph777 6/26/2008 |
296 |
7 |
0 |
educational
whatidiscover 7/9/2008 |
126 |
12 |
0 |
creative
CrisologaLapuz 7/15/2008 |
83 |
0 |
0 |
educational
CrisologaLapuz 7/15/2008 |
81 |
0 |
0 |
financial
anonymous 7/16/2007 | 313 | 4 | 0 | educational
lifeadvice 12/22/2007 |
105 |
0 |
0 |
DanaG 10/10/2008 |
30 |
1 |
0 |
creative
DanaG 10/6/2008 |
186 |
6 |
0 |
creative
DanaG 10/6/2008 |
991 |
0 |
0 |
creative
DanaG 9/27/2008 |
427 |
7 |
1 |
creative
DanaG 9/27/2008 |
85 |
3 |
0 |
creative
DanaG 9/25/2008 |
4413 |
4 |
0 |
creative
DanaG 9/22/2008 |
1595 |
5 |
0 |
DanaG 7/13/2008 |
14442 |
51 |
0 |
creative
DanaG 7/13/2008 |
17483 |
23 |
1 |
creative
DanaG 7/13/2008 |
1485 |
13 |
0 |
creative