Docstoc

Lecture 08

Document Sample
Lecture 08 Powered By Docstoc
					 Software for
Engineer Design




   Lecture 08
                                       Lecture 08

                                       Dataset
Subjects:

    Tables of data

    Importing Data

    Numerical data versus String data



                                     Keyword
Cell, header, non-numerical data, import, format.


                                     Abstract
This lecture focuses on data acquisition, manipulation, plotting, and meshing.
                                         Lecture 08

8.1 Datasets
Finding data sets is not easy, and it is just as difficult to create your own, especially when the
data is geographical and/or social in nature.

The U.S. government is comprised of hundreds of agencies that produce a vast amount of
statistical data related to all sectors of society. A few selected links follow:

       U.S. Department of Education
       U.S. Department of Health
       U.S. Census Bureau
       U.S. Department of Commerce
       U.S. Department of Labor
       Gateway to statistics from over 100 U.S. Federal Agencies

Some of these sites publish ready-made PDF files with tabulated entries. Others have custom-
designed interfaces that allow for access to data. Unfortunately, neither the sites, nor the
ready-made material, nor the query systems follow any guidelines.

Another source to consider are hard-copy publications, for example:

Statistical Abstract of the United States: 2006, The National Data Book, 125th Edition, US
Cencus BureauThis book contains statistics from all sectors of society, including
consumption, production, education, disabilities, etc. This book is available in the library.

8.2 Data Files

8.2.1 Tables of Data
Statistical data tends to come in a tabular format, e.g.:
Date           Location        Item       Price in cents
7/8/1997       New York        Apple      59
8/5/2000       Los Angeles     Banana     69
...            ...             ...        ...
The first row in this example is considered a "header", and the other rows are observations.
Some more complicated tables may have several lines of headers, and may include sub-
tables. For the purpose of using this data in Matlab, it is recommended that tables be re-
formatted as in this example. If necessary, remove undesired entries, move data around, and
possibly merge or split data. A good amount can be done through spreadsheets (e.g.
Microsoft Excel), if the data set is small enough.

8.2.2 File Format: CSV
Comma Separated Value files (CSV) are text files that contain human-readable data. Special
delimiters (commas, tabs, carriage returns, quotes) are put in place for separating columns
and rows. Glancing over such a data file may not reveal a sensible structure, but once loaded
into a spreadsheet application, columns and rows can be identified more easily. A typical
comma-separated file may look as follows:
Date,Location,Item,Price in cents
7/8/1997,New York,Apple,59
                                         Lecture 08
8/5/2000,Los Angeles,Banana,69
...
Note that every field in this example is separated by a comma, hence Comma Separated
Value. Sometimes, individual values are enclosed in double quotes:
"Date","Location","Item","Price in cents"
"7/8/1997","New York","Apple","59"
"8/5/2000","Los Angeles","Banana","69"
...
Matlab is able to import from CSV files, as this is the most universally portable file format.

8.2.3 File Format: XLS
Proprietary spreadsheet file formats that Matlab can import include MS Excel XLS files.
Unless errors occur when importing an XLS file, no pre-processing is required. Should errors
occur, it is recommended to inspect the file using MS Excel, and re-saving it. If this does not
help, the XLS data should be exported to CSV format within Excel and imported in Matlab.

8.2.4 File Format: MAT
MAT is Matlab's own format for storing data. It is possible to save an entire Workspace of
data, or selected matrices. It is unlikely that statistics are distributed in this format.

8.2.5 Size issues
MS Excel imposes a size limit on a single data set table. This limit is: 65536 rows and 256
columns (i.e. 216 * 28 = 224 (16,777,216) cells).

Matlab does not impose a pro forma limit on the dimensions and size of matrices. Keep in
mind that hardware memory and hard disk size (for swapping memory) are the ultimate
deciding factors of how much data can be loaded. For comparison, a reasonable size of data
that can be loaded into Matlab exceeds the capabilities of Excel by far. It is thus possible to
read in numerical values for a million data rows of five columns.

8.2.6 Pre-processing data with Matlab
If the size of a data set exceeds the limitations of Excel or other applications, it must be pre-
processed in Matlab. This may include one or more of the following steps:

       Cleaning of data by removing rows.
       Splitting data: if a data set contains data from several types of observations, filtering
       and splitting of the data may be necessary. This can be done by iterating over the data
       set and selectively moving data rows to other matrices. For example, if a dataset
       includes observations for U.S. states, U.S. regions, and U.S. cities, the three types
       may have to be moved to 3 different matrices.
       Sorting data: data sets can be sorted using the command sortrows.

An example of processing data with Matlab can be found in the collection of M-files in the
beginning of this lecture.
                                            Lecture 08

8.2.7 Pre-processing data outside of Matlab
When possible, it is recommended to prepare datasets in a spreadsheet program before
importing in Matlab. While Matlab does have a spreadsheet-like editor, it is not meant to
replace a spreadsheet program.

To prepare data in a spreadsheet, keep in mind that each column should maintain the same
data type (double, int, string, ...). Columns (or row) headers should be distinguishable, and
preferrably one per column (or row). Try to refrain from merging cells.

8.3 Importing Data
Matlab has several
command-line functions
that can be used to import
many data types, including
CSV files. However, for
simplicity we will use the
graphical interface. In the
"Current Directory" file
listing, highlight a data file.
At this point, we can either
use "File->Import", or open
the context menu and
choose "Import Data".




                                  Figure 8.1
                                  Click image to enlarge, or click here to open


Depending on the file type, the Import Wizard may start at different points in the import
process. When importing from an XLS file, only the last of the Import Wizard screens
appears. When importing from a CSV file, the process is slightly longer.
                                          Lecture 08

The first page of the wizard
displays a portion of the
text file, as well as a
preview of the matrix-
version of the data.




                                Figure 8.2
                                Click image to enlarge, or click here to open


The preview is broken into
2 spreadsheets: one for
"data" and one for
"textdata". "data" refers to
numerical data, while
"textdata" refers to anything
that is not unambiguously
numerical. That is, the
strings "my house" and
"60m" are considered
"textdata", while "4" and
"624.92746" are considered
"data". Matlab distinguishes
                                Figure 8.3
between the two and does
                                Click image to enlarge, or click here to open
not allow mixing of these
data types in matrices.
When importing a CSV file
with numerical and textual
data, Matlab thus splits the
data and creates two
matrices, one for each data
type. More on the
differences is discussed
below.
                                          Lecture 08

Several data-specific
decisions have to be made
on the first page. Under
"Select Column
Separator(s)", select the
delimiter that delimits
columns. In many cases,
this is the comma. Under
"Number of text header
lines", select the number of
non-data rows that appear
in the beginning of your
data file. The text header
                                Figure 8.4
lines will then not appear in
                                Click image to enlarge, or click here to open
the numerical data matrix.


Note that the preview on the
first page may not
accurately depict the final
matrices. The second screen
of the Import Wizard shows
a more realistic version of
the final matrices.

The second page of the
Import Wizard displays a
preview of the parsed
matrices from the data file.
From here, matrices can be
renamed and excluded for        Figure 8.5
the final import. It is         Click image to enlarge, or click here to open
recommended to inspect all
matrices and their sizes
before proceeding with the
import. When ready, hit the
"Finish" button.
After the import process is complete, the imported matrices will appear in the workspace.

It is sometimes desirable to create sub-matrices out of the imported ones, especially if the
import process did not successfully interpret all of the data. For example, the file regions.csv
clearly contains column headers (region name), row headers (dates), and numerical values
(price in cents). During the import process row and column headers were not identified, and
instead were cast into one large text matrix. The following expressions disect the text matrix
for easy processing later:
                                        Lecture 08

The first row of the textdata matrix contains
column headers, including the field "Date"
and region names. We extract the region
names:

regions=textdata(1,2:size(tex
tdata,2));

The first column of the textdata matrix
contains row headers, including the field
"Date" and individual dates. We extract the
dates:

dates=textdata(2:size(textdat
a,1),1);

There is no need to further process the data
matrix.                                           Figure 8.6
                                                  Click image to enlarge, or click here to open



8.4 Numerical data versus String data
Matlab uses several data types for differently typed data, and depending on the type, certain
operations are allowed and others are not. The predominant type is "Double", which can be
used for any numerical data: real, rational, and natural. "Uint8" is another numerical type,
which is constrained to natural numbers in the range of 0..255.

"Char" is a data type used for non-numerical data, i.e. textual data. Most imported data sets
contain textual data, such as header lines. Manipulation of "Cell" data is somewhat different
from numerical data. Below is a comparison of numerical versus textual data:

Operation                Numerical              Textual

Scalar type              Double                 Char

Assignment               a=5                    a='hello'

Multidimensional type Double                    Cell

Vector                   b=[1,5,2]              b={'abc','def','ghi'}

Indexing                 b(2)                   b{2}

Matrix                   c=[1,2,3;4,5,6]        c={'abc','def','ghi';'jkl','mno','pqr'}
                                                Lecture 08

Addition                     d=3+4                 d=strcat('ab','cd')

Conversion in between str2num('52.23')             num2str(6)

(Example)                    s='52.23'; 5+str2num(s), i=6; strcat('Hello Nr.', num2str(i))

Table 8.1




Figure 8.7: Doubles, Chars, and Cells                  Figure 8.8: Addition and Concatenation
Click image to enlarge, or click here to open          Click image to enlarge, or click here to open


There are many other
functions by which char and
cell can be manipulated.
The main application for
text manipulation for our
purposes is plotting and
meshing, especially for
assigning x,y,z labels, titles,
etc. For example, when
column headers have been
imported as textual data
(cells), we are now able to
manipulate and use them
for the purpose of building
bar graphs, plots, etc.
                                          Lecture 08

                                Figure 8.9: Indexing
                                Click image to enlarge, or click here to open




8.5 Plotting
In Matlab, plotting refers to producing 2-dimensional graphs, while meshing refers to 3-
dimensional graphs. Since a 2-dimensional graph is merely a collection of points, the
command plot takes as input a vector and simply plots the numbers.

Given vector Y, the
command plot(Y) plots
the point in the vector.
Without passing a separate
vector with x-values, each
point in vector Y is mapped
linearly to a point on the x-
axis. For example, if Y =
[10, 7, -9, 0, 1] ,
then the corresponding X
values are [1, 2, 3, 4,
5] , respectively. If this
scale is not desirable, an X
vector with a different scale
can be passed as an
argument to the function
plot.                           Figure 8.10
                                Click image to enlarge, or click here to open
                                         Lecture 08

Given vectors X and Y,
where X contains regularly
or irregularly spaced sample
points on the X axis, and Y
contains the corresponding
values in the Y direction,
the command plot(X,Y)
plots a graph of Y with the
scale of X.




                               Figure 8.11
                               Click image to enlarge, or click here to open


For example, consider the following data points:

y=[16,50,70,104,106,104,95,80,67,59,87,124,153,157,144,127,...
109,90,71,100,134,163,178,179,174,161,141,117,93,76,89,105,...
123,140,153,156,144,128,106,86,65,48,30,17,24,29,25,21,16,7];

Plotted in regular (default)
scale assigns each data
point to a proportionally
increasing (+ 1) x value:

plot(y);




                               Figure 8.12
                                          Lecture 08

                               Click image to enlarge, or click here to open


x=[10,5,3,2,9,14,17,20,25,27,28,29,30,38,45,49,52,54,58,...
59,60,62,66,72,78,81,82,84,87,90,97,102,106,109,112,119,...
125,128,126,122,118,117,121,134,154,174,190,194,194,185];

Given a different data range
corresponding to the same
y-data the graph exhibits
distinct differences.

plot(x,y);




                               Figure 8.13
                               Click image to enlarge, or click here to open


String modifiers can be used to change color, data point, and line styles. A summary is given
in table 8.2.

Colors      Point style        Line style

b blue      . point            - solid

g green     o circle           : dotted

r red       x x-mark           -. dashdot

c cyan      + plus             -- dashed

m magenta * star               (none) no line

y yellow    s square

k black     d diamond
                                           Lecture 08

            v triangle (down)

            ^ triangle (up)

            < triangle (left)

            > triangle (right)

            p pentagram

            h hexagram

Table 8.2

At most one modifier can be taken from each column and concatenated to result in a unique
line/point/color style. For example:

plot(x,y,'r');




                                 Figure 8.14
                                 Click image to enlarge, or click here to open
                             Lecture 08

plot(x,y,'g:');




                   Figure 8.15
                   Click image to enlarge, or click here to open


plot(x,y,'md:');




                   Figure 8.16
                   Click image to enlarge, or click here to open
                                        Lecture 08

A figure's background color
can be changed using the
command whitebg . For
example:

whitebg('k')




                              Figure 8.17
                              Click image to enlarge, or click here to open


whitebg('y')




                              Figure 8.18
                              Click image to enlarge, or click here to open



8.5.1 Bar Charts
2D bar graphs plot data points in terms of their area. Given some random data:

xBar=[1:10];
                                         Lecture 08

yBar=rand(1,10) * 100;

a bar chart is produced
using:

bar(xBar,yBar);




                               Figure 8.19
                               Click image to enlarge, or click here to open


2D bar charts can also be
created for matrices, in
which case each row in the
matrix is considered as one
group of bars. The resulting
graph distinguishes matrix
columns with different
colors.

yBar=rand(7,3);
bar(yBar);




                               Figure 8.20
                               Click image to enlarge, or click here to open
                                        Lecture 08

3D bar graphs are easily
obtained from matrices as
well, using the bar3
command:

bar3(yBar)




                              Figure 8.21
                              Click image to enlarge, or click here to open



8.5.2 Labels
Common properties of all figures, whether 2D or 3D, plots, bar graphs, meshed, etc. are axes
labels, titles. Every figure should be properly labeled for clarity.

Axes labels can be assigned
using commands xlabel ,
ylabel , or zlabel ,
selectively or in
combination:

plot(x, y, 'r*-
.'),
xlabel('South'),
ylabel('West')




                              Figure 8.22
                              Click image to enlarge, or click here to open
                                          Lecture 08

A title is added by using the
command title :

plot(x, y, 'r*-
.'),
xlabel('South'),
ylabel('West'),
title('Mysterious
Constellation of a
Waiving Hand');




                                Figure 8.23
                                Click image to enlarge, or click here to open


By default, the data range
of x is used for labeling
individual tick marks on the
x-axis. Alternatively,
named values can be used
as replacements.

x=1:8;
y=rand(1,8) * 100;
plot(x,y);
set(gca,
'XTickLabel',
{'Earth',
'Mercure',
'Saturn', 'Venus',
'Pluto',
'Neptune', 'Mars',
                                Figure 8.24
'Jupiter'})                     Click image to enlarge, or click here to open

The command set in this
case changes property
XTickLabel for figue handle
gca (default figure) to a
                                         Lecture 08

vector of strings.


Using the data imported
from regions.csv, we plot
the data matrix, and assign
labels from the previously
created vector dates:

plot(data(:,1));
set(gca,
'XTickLabel',
dates);

This plot does not,
however, exhibit the correct
labels. Because of the large
number of labels, Matlab
decides to space them apart,
seemingly irrationally.
                               Figure 8.25
                               Click image to enlarge, or click here to open


To display all tick marks on
the x-axis, the following
series of commands are
necessary:

plot(data(:,1));
set(gca, 'XTick',
1:length(dates));

However, this x-axis is not
readable, which is my
Matlab distributed the tick
marks in the first place.




                               Figure 8.26
                               Click image to enlarge, or click here to open
                                         Lecture 08

The following expressions
help in spacing out the tick
marks, while maintaining the
correct index into the label
vector:

plot(data(:,1));
set(gca, 'XTick',
20:50:length(dates))
;

Essentially, starting at label
index 21, every 50th label
index is used for tick marks.



                                  Figure 8.27
                                  Click image to enlarge, or click here to open


To show the actual dates for these
tick marks, we replace them using
the XTickLabel feature:

plot(data(:,1));
set(gca, 'XTick',
20:50:length(dates));
set(gca, 'XTickLabel',
dates(20:50:length(dates
)));




                                         Figure 8.28
                                         Click image to enlarge, or click here to open


In cases where the x-axis is labeled with long strings per tick mark, it is desirable to use
slanted labels. While Matlab's plot function does not allow for rotation of labels, there exist
functions that replace the mechanism by which labels are placed on the x-axis. One such
function can be downloaded here:
                                           Lecture 08

       xticklabel_rotate.m


Given an existing plot with
numerical labels on the x-
axis, the function
xticklabel_rotate
rotates all labels by 90o.

a=rand(1,30);
plot(a);
xticklabel_rotate;




                                 Figure 8.29
                                 Click image to enlarge, or click here to open


To rotate the labels for a
different amount, the degree
can be passed as a second
parameter. The first parameter
in this example remains empty
(empty set []). This signifies
that the x-ticks or labels
should not be changed, but
merely rotated.

plot(a);
xticklabel_rotate([]
, 45);

Note: Once the function
xticklabel_rotate has been
applied once to a given graph,
                                    Figure 8.30
it cannot be applied again. The     Click image to enlarge, or click here to open
plot command needs to be re-
executed, and
xticklabel_rotate needs to be
                                        Lecture 08

called again.

If it is desirable to use different x-tick spacings, as discussed above (see Figures 8.26, 8.27,
and 8.28), the function xticklabel_rotate can be used instead of the function set(gca, ...).
xticklabel_rotate can set the vector of x-ticks as well as the labels, whether numerical or text.

Using xticklabel_rotate on the dataset of
gasoline prices, it makes sense to space the
x-tick marks farther apart, because there
are too many to fit on the x-axis. We pass
an indexed vector as a first parameter to
space out the x-tick marks:

plot(data(:,1));
xticklabel_rotate(20:15:size
(data,1), 45);

This displays and rotates every 15th tick
label on the x-axis.




                                               Figure 8.31
                                               Click image to enlarge, or click here to open


Finally, to display dates (textual data) as
opposed to numerical labels, we pass the
cell vector as a third parameter. The cell
vector is properly indexed to match the x-
tick vector (first parameter):

plot(data(:,1));
xticklabel_rotate(20:15:size
(data,1), 45,
dates(20:15:size(dates)));




                                               Figure 8.32
                                        Lecture 08

                                               Click image to enlarge, or click here to open



8.5.3 Overlaying plots
Several plots can be placed in the same figure by overlaying them.

The simplest approach is to
plot a matrix of values, in
which each column is
interpreted as one vector.

y=rand(10, 3);
plot(y);




                              Figure 8.33
                              Click image to enlarge, or click here to open


For the example of gasoline
prices in regions.csv:

plot(data);




                              Figure 8.34
                                          Lecture 08

                                Click image to enlarge, or click here to open


Alternatively, vectors can
also be placed in the same
graph individually by using
the hold on and hold off
functions:

hold on;
plot(data(:,2),'r')
;
plot(data(:,4),'g')
;
plot(data(:,6),'b')
;
plot(data(:,8),'y')
;
hold off;
                                  Figure 8.35
                                  Click image to enlarge, or click here to open


For multi-line graphs,
legends can be added for
descriptive purposes. The
function legend takes as
many string arguments as
there are plots, and assigns
each string to a plot, in the
order in which they were
placed in the graph:

legend('the red
graph', 'the green
graph', 'the blue
graph', 'the
yellow graph');


                                Figure 8.36
                                Click image to enlarge, or click here to open
                                         Lecture 08

Using the actual labels from
textdata:

legend(regions(1,2:2:8
));




                                       Figure 8.37
                                       Click image to enlarge, or click here to open



8.5.4 Meshing (3d graphs)
3D graphs are generated using the function mesh or surf . Given a 2D matrix of values,
each value is used as a z-value (elevation), and placed in a 3D view.

Given a function of sine and cosine:

z=[];
for i=1:100
    for j=1:100
        z(i,j) = sin(i/10) + cos(j/10);
    end
end
                                        Lecture 08

mesh(z);

creates a mesh (with holes)




                              Figure 8.38
                              Click image to enlarge, or click here to open


surf(z);

creates a mesh with filled
patches (a surface).




                              Figure 8.39
                              Click image to enlarge, or click here to open
                                         Lecture 08

For the data set of gasoline
prices per region, a viable
mesh would be:

mesh(data);




                               Figure 8.40
                               Click image to enlarge, or click here to open


And labels, and titles can be added
as appropriate:

mesh(data);
title('Gasoline prices
in U.S. regions');

xlabel('Region');
ylabel('Date');
zlabel('Price in
cents');

set(gca, 'XTick',
1:length(regions));
set(gca, 'XTickLabel',
regions);
                                         Figure 8.41
set(gca, 'YTick',                        Click image to enlarge, or click here to open
20:50:length(dates));
set(gca, 'YTickLabel',
dates(20:50:length(dates
)));
                                        Lecture 08

8.5.5 Multiple plots
Using the subplot function, it is possible to generate separate plots in a grid of figures.

SUBPLOT(M, N, P) creates a grid of figures for M rows, N columns, and fills the Pth cell
with the next figure.

Example:

subplot(3, 2, 1),
plot(rand(1, 10));
subplot(3, 2, 2),
bar(rand(1, 10));
subplot(3, 2, 3),
surf(rand(20));
subplot(3, 2, 4),
hist(rand(50));
subplot(3, 2, 5),
plot(sin([0:0.1:10]))
;
subplot(3, 2, 6),
plot(rand(1,100),'gd:
');
                                     Figure 8.42
                                     Click image to enlarge, or click here to open


Each figure can be assigned
its own labels and titles:




                               Figure 8.43
                              Lecture 08

                    Click image to enlarge, or click here to open


subplot(3, 2, 1), plot(rand(1, 10)), xlabel('x-axis 1'),
ylabel('y-axis 1'), title('random line');
subplot(3, 2, 2), bar(rand(1, 10)), xlabel('x-axis 2'),
ylabel('y-axis 2'), title('random bars');
subplot(3, 2, 3), surf(rand(20)), xlabel('x-axis 3'),
ylabel('y-axis 3'), zlabel('z-axis 3'), title('random
surface');
subplot(3, 2, 4), hist(rand(50)), xlabel('x-axis 4'),
ylabel('y-axis 4'), title('random histogram');
subplot(3, 2, 5), plot(sin([0:0.1:10])), xlabel('x-axis 5'),
ylabel('y-axis 5'), title('sine wave');
subplot(3, 2, 6), plot(rand(1,100),'gd:'), xlabel('x-axis 6'),
ylabel('y-axis 6'), title('Matrix, the Movie');
                                   Lecture 08

                                     Links
http://202.5.195.17/emust/web/

http://uranchimeg.com/Education/?page_id=1534

http://www.aquaphoenix.com/

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:12/30/2011
language:
pages:30