Exploratory Data Analysis
GG 313 Fall 2003 8/25/05
Save October 24-27 (Monday-Thurs)
For a STUDENT CRUISE on the
R/V Kilo Moana
• We did an example Tuesday with the
tide data - let’s look at another:
These data taken at Scripps Pier (LaJolla,
Ca) on Dec. 26 2004. The data are
taken every second and the units are
We’ll do a very simple MatLab plot. There
are too many data points for Excel to
handle. The data are in my computer in file
df07301.txt in a single column of numbers.
The Matlab commands are:
These data show the tidal components as the long-
period oscillation, the normal ocean waves as the
thickening of the blue line, and the signal from the
Sumatra earthquake tsunami as the larger thickening.
What does this plot tell us about our data?
It’s clean - no wild points
If we’re after the tsunami signal, we’ve got it
If we want to see it better, we need to do some
There are 86400 sec in a day, so this plot is
500000/86400=5.78 days long.
Just for fun, let’s try one more technique before we leave
this data set.
The tidal signal is noise for us, so let’s subtract it from the
data. To do this we first apply a FILTER to the data to
isolate the tidal signal:
windowsize=3600 ; % that’s 1 hour
This does a pretty good job of SMOOTHING the data,
isolating the tidal signal from the waves - both tsunami
and wind waves.
Now we subtract the filtered data from the original data:
What does BAD data look like?
Be extremely careful before discounting the validity of
data. Some of the most important theories have come
from data that looked wrong.
Early El-Nino data were rejected by a computer
program because they were so far from normal!
Recognition of bad data takes experience.
Determination of the origin of these data is particularly
important to be sure that rejection is justified.
Here are some examples:
Expected deepest depth Anomalously deep data
The anomalous data don’t look bad, but they are too
deep by 750 m. This is a key number in that sound
travels at 1500 m/s in water, and sound from a ship is
observed reflecting off the ocean floor at 750 m depth
for each second. That is, sound returning to the ship
one second after it was generated implies a water
depth of 750m.
We often generate sound pulses once per second, so
it’s easy to make a 750 m mistake.
A good example of how anomalous data can lead to
discovery was presented by Lord Rayleigh in 1894. In an
investigation of the density of nitrogen, he collected the data
These data don’t look much different from each other, but
take a closer look:
The scatter in these data certainly do not look like what
would be expected from sampling of a single population
- and the distribution of the data do not look like they
could be caused by measurement error. In fact, the
higher weight data come from air, and the lower weights
come from nitrogen in chemicals. Rayleigh used these
data to prove that another element was present in air -
Box and whisker plot
Another important plot for preliminary analysis is the box
and whisker plot. At least five statistical values are plotted
to get a quick look at some basic statistics of data
25% median 75% value
The box shows the region containing the middle half of the
data, and the vertical line shows the middle, or median,
Box and whisker plots are most informative to compare
The plots above show what might be expected from samples
of an experiments with a Poisson distribution.
The box and whisker plot for Lord Rayleigh’s data looks
This plot tells us that the distribution of these data is weird,
and we should look closely at it to see what’s going on.
Histograms are used to plot the frequency of occurrence
of particular events. For example, the time between large
earthquakes in the Aleutian subduction zone:
(these are not real data)
0 5 10 15 20
We will see more of these plots as we study the basics
of statistics and probability.
As we saw earlier, filtering, which we will discuss in detail
later, can go a long way to separate signals from noise.
Different filters have different functions and characteristics.
Functions in exploratory analysis include removal of
occasional bad data points, removal of high frequency
noise, and removal of trends.
Removal of occasional bad points is best done by a
median filter. This filter compares three consecutive
points, replacing the middle point by the median point of the
three. This is a very effective filter for removing noise
spikes in data-as long as the spikes are separated by more
than 1 point.
In the Scripps Pier data, I’ve replaced occasional data
points by zeroes. This is a common problem in real data.
Application of the median filter adds a small amount of
noise while removing the spikes.
After applying the median filter, the data are (nearly) back
The Matlab function for this operation is medfilt1(x,3)
Where x is the data and 3 is the number of points in the
Smoothing can also involve the removal of high frequency
noise, like we did to isolate the tidal signal in the Scripps
pier data. A Hanning filter can do this by marching through
the data 3 points at a time, weighting the middle point
higher than the ones on each side:
y(i 1) 2y(i) y(i 1)
Note: Wessel’s notes are not quite correct for this
This filter works poorly for spikes in the data -
spreading the spikes out, rather than removing
Often data can be divided into parts - a smooth trend
and a higher frequency signal. Your signal could be
either the trend or the residual after the trend is
To remove a linear trend from data, you could pick two
points to define the line, x1, y1 and x2, y2. The linear
trend is then:
y 2 y1
y trend y1
x 2 x1
This is the equation of a straight line which can be
subtracted from the data.
The trend need not be linear, and other functions can be
tried to remove a trend, such as √y, log(y), y2, etc. -
whatever fits. Often, a good understanding of why the
trend is there can aid in its removal.
Let’s run a MatLab program Dr. Wessel wrote to display
some of the topics we’ve been discussing.