Methods of Exploratory Data Analysis by nikeborome

VIEWS: 10 PAGES: 20

									       Methods of
Exploratory Data Analysis
   GG 313 Fall 2003 8/25/05
           CRUISE
Save October 24-27 (Monday-Thurs)
 For a STUDENT CRUISE on the
         R/V Kilo Moana
            Scatter Plots
• We did an example Tuesday with the
  tide data - let’s look at another:

These data taken at Scripps Pier (LaJolla,
 Ca) on Dec. 26 2004. The data are
 taken every second and the units are
 cm.
We’ll do a very simple MatLab plot. There
are too many data points for Excel to
handle. The data are in my computer in file
df07301.txt in a single column of numbers.

The Matlab commands are:

load 'df07301.txt’
plot (df07301)
These data show the tidal components as the long-
period oscillation, the normal ocean waves as the
thickening of the blue line, and the signal from the
Sumatra earthquake tsunami as the larger thickening.

What does this plot tell us about our data?

       It’s clean - no wild points
       If we’re after the tsunami signal, we’ve got it
       If we want to see it better, we need to do some
               analysis

There are 86400 sec in a day, so this plot is
500000/86400=5.78 days long.
Just for fun, let’s try one more technique before we leave
this data set.

The tidal signal is noise for us, so let’s subtract it from the
data. To do this we first apply a FILTER to the data to
isolate the tidal signal:

windowsize=3600 ; % that’s 1 hour
lpout=filter(ones(1,windowsize)/windowsize,1,df07301);

This does a pretty good job of SMOOTHING the data,
isolating the tidal signal from the waves - both tsunami
and wind waves.

Now we subtract the filtered data from the original data:

Hipass=df07301-lpout;
            What does BAD data look like?
Be extremely careful before discounting the validity of
data. Some of the most important theories have come
from data that looked wrong.
Early El-Nino data were rejected by a computer
program because they were so far from normal!
Recognition of bad data takes experience.
Determination of the origin of these data is particularly
important to be sure that rejection is justified.
Here are some examples:
  0

1500


3000

       Expected deepest depth       Anomalously deep data


The anomalous data don’t look bad, but they are too
deep by 750 m. This is a key number in that sound
travels at 1500 m/s in water, and sound from a ship is
observed reflecting off the ocean floor at 750 m depth
for each second. That is, sound returning to the ship
one second after it was generated implies a water
depth of 750m.
We often generate sound pulses once per second, so
it’s easy to make a 750 m mistake.
A good example of how anomalous data can lead to
discovery was presented by Lord Rayleigh in 1894. In an
investigation of the density of nitrogen, he collected the data
shown below:




  These data don’t look much different from each other, but
  take a closer look:
The scatter in these data certainly do not look like what
would be expected from sampling of a single population
- and the distribution of the data do not look like they
could be caused by measurement error. In fact, the
higher weight data come from air, and the lower weights
come from nitrogen in chemicals. Rayleigh used these
data to prove that another element was present in air -
argon.
                  Box and whisker plot


 Another important plot for preliminary analysis is the box
 and whisker plot. At least five statistical values are plotted
 to get a quick look at some basic statistics of data
 samples.



minimum                                              maximum
            25%                median       75%      value
value

 The box shows the region containing the middle half of the
 data, and the vertical line shows the middle, or median,
 value.
 Box and whisker plots are most informative to compare
 different samples:




0

The plots above show what might be expected from samples
of an experiments with a Poisson distribution.
 The box and whisker plot for Lord Rayleigh’s data looks
 like this:




This plot tells us that the distribution of these data is weird,
and we should look closely at it to see what’s going on.
                    Histograms

Histograms are used to plot the frequency of occurrence
of particular events. For example, the time between large
earthquakes in the Aleutian subduction zone:



                        (these are not real data)




            0       5       10          15          20

                           Years




We will see more of these plots as we study the basics
of statistics and probability.
                       Smoothing
As we saw earlier, filtering, which we will discuss in detail
later, can go a long way to separate signals from noise.
Different filters have different functions and characteristics.

Functions in exploratory analysis include removal of
occasional bad data points, removal of high frequency
noise, and removal of trends.

Removal of occasional bad points is best done by a
median filter. This filter compares three consecutive
points, replacing the middle point by the median point of the
three. This is a very effective filter for removing noise
spikes in data-as long as the spikes are separated by more
than 1 point.
In the Scripps Pier data, I’ve replaced occasional data
points by zeroes. This is a common problem in real data.
Application of the median filter adds a small amount of
noise while removing the spikes.
After applying the median filter, the data are (nearly) back
to normal:




   The Matlab function for this operation is medfilt1(x,3)
   Where x is the data and 3 is the number of points in the
   window.
     Smoothing can also involve the removal of high frequency
     noise, like we did to isolate the tidal signal in the Scripps
     pier data. A Hanning filter can do this by marching through
     the data 3 points at a time, weighting the middle point
     higher than the ones on each side:
                                y(i 1)  2y(i)  y(i 1)
            filtered_ data(i) 
                                           4
Note: Wessel’s notes are not quite correct for this
equation.


This filter works poorly for spikes in the data -
spreading the spikes out, rather than removing
them.
                 Residual plots

Often data can be divided into parts - a smooth trend
and a higher frequency signal. Your signal could be
either the trend or the residual after the trend is
removed.

To remove a linear trend from data, you could pick two
points to define the line, x1, y1 and x2, y2. The linear
trend is then:
                            y 2  y1 
           y trend    y1            
                                        x  x1
                            x 2  x1 

This is the equation of a straight line which can be
subtracted from the data.
The trend need not be linear, and other functions can be
tried to remove a trend, such as √y, log(y), y2, etc. -
whatever fits. Often, a good understanding of why the
trend is there can aid in its removal.


Let’s run a MatLab program Dr. Wessel wrote to display
some of the topics we’ve been discussing.


             gg313_EDA.m

								
To top