WELCOME TO STT200

Document Sample

```					    WELCOME TO STT 200
• INSTRUCTOR: DR. Elijah E. DIKONG

• VISITING PROFESSOR

• COUNTRY: CAMEROON [AFRICA]
• CLASS WEBSITE:

– http://www.stt.msu.edu
1
What Is Statistics?
Statistics: Two Different Meanings:
(a) IN PLURAL SENSE, STATISTICS MEANS A SET
OF OBSERVATIONS, USUALLY COLLECTED BY
MEASUREMENTS OR COUNTING, COLLECTIVELY
KNOWN AS DATA.
(b) IN SINGULAR SENSE, STATISITICS REFERS TO A
GROUP OF SCIENTIFIC METHODS USED TO
* collecting data
* interpreting and analyzing data
* making conclusions or inferences.
2
TYPES OF STATISTICS

DESCRIPTIVE
INFERENTIAL
STATISTICS                          STATISTICS

STATISTICS REQUIRES SIX PROCEDURES

1.UNDERSTANDING   3.PLANNING        5.CHECKING

4.EXECUTING
2.ANALYZING                          6.REPORTING

3
DESCRIPTIVE STATISTICS
• DEFINED AS THOSE METHODS INVOLVING THE COLLECTION,
PRESENTATION, AND CHARACTERIZATION OF A SET OF
DATA IN ORDER TO DESCRIBE PROPERLY THE VARIOUS
FEATURES OF THAT SET OF DATA. TO ACHIEVE THESE,
STATISTICIANS USE TABLES – EITHER FREQUENCY OR
CONTIGENCY; BAR AND PIE CHARTS; STEM-AND-LEAF
DISPLAYS; BOX-AND-WHISKER PLOTS; PARETO DIAGRAMS;
HISTOGRAMS.

• INFERENTIAL STATISTICS
DEFINED AS THOSE METHODS, E.G. PROBABILITY THEORY,
THAT MAKE POSSIBLE THE ESTIMATION OF A
CHARACTERISTIC OF A POPULATION OR THE MAKING OF A
DECISION CONCERNING A POPULATION BASED ONLY ON
SAMPLE RESULTS.
4
RELEVANT

STATISTICAL

TERMINOLOGIES
5
POPULATION VERSUS SAMPLE

• POPULATION – A POPULATION IS THE
WHOM YOU WANT TO MAKE
CONCLUSIONS.
• SAMPLE: A REPRESENTATIVE SUBSET OF
THE POPULATION FOR WHOM YOU
ACTUALLY HAVE DATA.

• ILLUSTRATION – “POT OF SOUP”

6
EXAMPLE: IDENTIFY THE
POPULATION AND THE SAMPLE
• A QUESTION POSTED ON THE LYCOS WEBSITE IN THE USA
ON 18 JUNE 2000 ASKED VISITORS TO THE SITE TO SAY
WHETHER THEY THOUGHT MARIJUANA SHOULD BE
LEGALLY AVAILABLE FOR MEDICINAL PURPOSES.

• THE GALLUP POLL INTERVIEWED 1007 RANDOMLY
SELECTED U.S. ADULTS AGED 18 AND OLDER, MARCH 23 –
25, 2007. GALLUP REPORTS THAT WHEN ASKED IF EVER,
THE EFFECTS OF GLOBAL WARMING WILL BEGIN TO
HAPPEN, 60% OF THE RESPONDENTS SAID THE EFFECTS
WOULD NEVER HAPPEN.

7
DEFINITIONS – PARAMETER VERSUS
STATISTIC

• PARAMETER (POPULATION PARAMETER):
A PARAMETER IS A NUMERICAL SUMMARY
OF THE POPULATION.

• STATISTIC (SAMPLE STATISTIC) – A
STATISTIC IS A NUMERICAL SUMMARY OF
A SAMPLE TAKEN FROM THE POPULATION.

• ILLUSTRATION:

8
DATA: SYSTEMATICALLY RECORDED INFORMATION,
WHETHER NUMBERS OR LABELS, TOGETHER WITH
ITS CONTEXT
CONTEXT TELLS WHO, WHAT, WHEN, WHERE, HOW and
WHY IS BEING MEASURED.
CONTEXT                     WHERE
PLACE
E.G.
WHAT                             CITY
CHARACTERISTICS
WH0           EACH INDIVIDUAL    WHEN
PURPOSE
(VARIABLES)
TIME[DAYS,    OF STUDY
WHOM DATA ARE                      YEARS,
COLLECTED(PARTICIPANTS,            ETC.]         HOW
RESPONDENTS, SUBJECTS,
EXPERIMENTAL UNITS,
METHOD OF
RECORDS, CASES                           COLLECTING
DATA. E.G.
9
SURVEY
CLASS DISCUSSION 1

• BECAUSE OF THE DIFFICULTY OF
WEIGHING A BEAR IN THE WOODS,
RESEARCHERS CAUGHT AND MEASURED
54 BEARS, RECORDING THEIR WEIGHT,
NECK SIZE, LENGTH, AND SEX. THEY
HOPED TO FIND A WAY TO ESTIMATE THE
WEIGHT FROM THE OTHER, MORE EASILY
DETERMINED QUANTITIES. IDENTIFY THE
W’S.

10
DATA TABLE – AN ARRANGEMENT OF DATA IN WHICH EACH
ROW REPRESENTS A CASE[AN INDIVIDUAL ABOUT WHOM OR WHICH
WE HAVE DATA] AND EACH COLUMN REPRESENTS A VARIABLE.

NAME    AGE    TIME     AREA   NEAREST   INTERNET   CATALOG   ARTIST
STUDIUM   PURCHASE   NUMBER
(YR)   (DAYS)   CODE

CATHY           130      312     ALI                7TY73     MASS
22                                 Y

SAM      24      18      305   LINCO                CKJ24     BOST
N

CHRIS    43     368      610    VET                 JKN23     FLORI
Y

LINDA             5      413   SPAR                 7O28Y     APRIL
35                                 Y

11
VARIABLES
EACH INDIVIDUAL ARE CALLED VARIABLES.

TYPES OF VARIABLES

CATEGORICAL                        QUANTITATIVE
(QUALITATIVE)                      (NUMERICAL)

OUTCOMES ARE NUMBERS,
OUTCOMES FALL INTO               EITHER DISCRETE OR CON-
CATEGORIES. OUTCOMES             NUOUS. EXAMPLES:
MAYBE IN WORDS OR                *HEIGHTS OF MSU STUDENTS
NUMERALS. EXAMPLES:              *NUMBER OF FLOWERS ON A
*COLOR OF EYES(BLUE, BROWN,…      PLANT.
*PROFESSION(ENGINEER,            *NUMBER OF SUCCESSFUL
FARMER, TEACHER,…                 SURGERIES AT SPARROW
12
HOSPITAL LAST FALL.
QUANTITATIVE VARIABLES

• DISCRETE QUANTITATIVE VARIABLE: A
VARIABLE IS DISCRETE IF IT TAKES ITS
VALUE FROM A COUNTABLE SET OF
NUMBERS LIKE {0, 1, 2, 3, 4, … }

• CONTINUOUS QUANTITATIVE VARIABLE: A
VARIABLE IS CONTINUOUS IF IT TAKES ITS
POSSIBLE VALUES FROM AN INTERVAL OR
A CONTINUUM LIKE [2, 7], (- 5, 10), OR THE
ENTIRE NUMBER LINE, R.
13
QUANTITATIVE AND
QUALITATIVE(CATEGORICAL) DATA

• DATA COLLECTED FROM A
QUANTITATIVE VARIABLE IS CALLED
QUANTITATIVE DATA.
• EXAMPLES INCLUDE HEIGHT,
WEIGHT, OF STUDENTS. TIME TO
• DATA COLLECTED FROM A
CATEGORICAL VARIABLE IS CALLED
CATEGORICAL DATA.
14
CLASS WORK 2
IN JUNE 2003 CONSUMER REPORTS PUBLISHED AN ARTICLE
ON SOME SPORT UTILITY VEHICLES THEY HAD TESTED
RECENTLY. THEY REPORTED SOME BASIC INFORMATION
ABOUT EACH OF THE VEHICLES AND THE RESULTS OF SOME
TESTS CONDUCTED BY THEIR STAFF. AMONG OTHER THINGS,
THE ARTICLE TOLD THE BRAND OF EACH VEHICLE, ITS PRICE,
AND WHETHER IT HAD A STANDARD OR AUTOMATIC
TRANSMISSION. THEY REPORTED THE VEHICLE’S FUEL
ECONOMY, ITS ACCELERATION(NUMBER OF SECONDS TO
GO FROM ZERO TO 60MPH), AND ITS BRAKING DISTANCE
TO STOP FROM 60MPH. THE ARTICLE ALSO RATED EACH
VEHICLE’S RELIABILITY BETTER THAN AVERAGE,
AVERAGE, WORSE, OR MUCH WORSE THAN AVERAGE.

IDENTIFY THE W’S. LIST THE VARIABLES. INDICATE WHETHER
EACH VARIABLE IS CATEGORICAL OR QUANTITATIVE. IF
THE VARIABLE IS QUANTITATIVE, TELL THE UNITS.

15
CLASS WORK 3
IN JUNE 2000, A HOMEOWNER IN TUSCOLA, ILLINOIS,
WANTED TO DETERMINE IF GENERIC FERTILIZER
AND WEED KILLER IS AS EFFECTIVE AS THE
MORE EXPENSIVE NAME BRAND PRODUCT.
AFTER THE SPRING RAINS AND EARLY SUMMER
WARMTH, HE COUNTED THE NUMBER OF WEEDS
IDENTIFY WHO, WHERE, WHEN, AND WHY FOR THE
SITUATION DESCRIBED.
A. A HOMEOWNER; TUSCOLA, ILLINOIS, JUNE 2000,
COMPARE PRODUCTS.
B. TWO PATCHES OF LAWN; TUSCOLA, ILLINOIS;
JUNE 2001; COMPARE PRODUCTS.
C. TWO PATCHES OF LAWN; ARCOLA, ILLINOIS;
JUNE 2000; COMPARE PRODUCTS.
D. A HOMEOWNER; ARCOLA, ILLINOIS; JUNE 2000;
COMPARE PRODUCTS.
E. TWO PATCHES OF LAWN; TUSCOLA, ILLINOIS; 16
JUNE 2000; COMPARE PRODUCTS.
CLASS WORK 4
AN ADMINISTRATOR IN A SCHOOL DISTRICT WITH SEVERAL
FIFTH GRADE CLASSROOMS OF ESSENTIALLY THE SAME
SIZE COLLECT DATA ON THE VARIOUS CLASSES. AMONG
THE VARIABLES WERE THE NUMBER OF SINGLE PARENT
FAMILIES, AVERAGE FAMILY INCOME, STRUCTURE OF
SCHOOL(K-5, 5-8, K-8), NUMBER ELIGIBLE FOR
LUNCH(YES/NO), AVERAGE DISTANCE TO SCHOOL, AND
NUMBER OF PARENTAL VISITS TO SCHOOL.

SELECT THE STATEMENT THAT CLASSIFIES THE VARIABLES
IN ORDER WITH Q REPRESENTING A QUANTITATIVE
VARIABLE AND C REPRESENTING A CATEGORICAL
VARIABLE.

(A)   C,Q,C,Q,C,Q,Q
(B)   Q,C,Q,C,Q,C,C
(C)   Q,Q,C,Q,C,Q,C,
(D)   C,C,Q,C,Q,C,C.
(E)   Q,Q,C,Q,C,Q,Q.
17
MEASURES OF CENTER OF
QUANTITATIVE DATA
• THE CENTER IS A VALUE THAT
ATTEMPTS THE IMPOSSIBLE BY
SUMMARIZING THE ENTIRE
DISTRIBUTION OR DATA SET WITH A
SINGLE NUMBER, A “TYPICAL”
VALUE. MEASURES OF CENTER
INCLUDE THE MEAN AND THE
MEDIAN.

18
DEFINITION

• MEAN: THE MEAN IS THE SUM OF THE
OBSERVATIONS DIVIDED BY THE
NUMBER OF OBSERVATIONS.

• MEDIAN: THE MEDIAN IS THE
MIDPOINT OF THE OBSERVATIONS
WHEN THEY ARE ORDERED FROM
THE SMALLEST TO THE LARGEST (OR
FROM THE LARGEST TO SMALLEST).
19
EXAMPLE
• FIND THE MEAN AND MEDIAN OF THE
SET OF OBSERVATIONS: 7, 1, 5, 3, 4.

• FIND THE MEAN AND MEDIAN OF 4, 2,
8, 6.

20
CHALLENGE QUESTION

• PROFESSOR DIKONG GAVE HIS FIRST
TEST TO HIS STT 200 STUDENTS. HIS
COLLEAGUE IS INTERESTED HOW HIS
STUDENTS PERFORMED IN THE TEST.
• HOW SHOULD PROFESSOR DIKONG
ANSWER IN ORDER TO GIVE HIS
COLLEAGUE A BETTER IDEA OF HOW
HIS STUDENTS PERFORMED IN THE
TEST?
21
OUTLIERS
• OUTLIERS ARE UNUSUAL OR EXTREME
VALUES THAT DO NOT APPEAR TO
BELONG WITH THE REST OF THE DATA.
• SUCH STRAGGLERS STAND OFF AWAY
FROM THE BODY OF THE DISTRIBUTION OF
DATA SET.
• OUTLIERS CAN AFFECT MANY
STATISTICAL ANALYSES, SO YOU SHOULD

22
QUANTITATIVE DATA
• A MEASURE OF SPREAD IS A
NUMERICAL SUMMARY OF HOW
TIGHTLY THE VALUES ARE
CLUSTERED AROUND THE CENTER.
– STANDARD DEVIATION
– INTERQUARTILE RANGE (IQR)
– RANGE

23
RANGE = (MAXIMUM OBSERVATION) –
(MINIMUM OBSERVATION)
• EXAMPLE: FIND THE RANGE OF THE
DATA SET: 45, 46, 49, 35, 76, 80, 89, 94,
37, 61, 62, 64, 68, 56, 57, 57, 71, 72

• MAXIMUM OBSERVATION = 94
• MINIMUM OBSERVATION = 35
• RANGE = MAX – MIN = 94 – 35 = 59

24
VARIANCE AND STANDARD
DEVIATION
• THE RANGE USES ONLY THE
LARGEST AND SMALLEST
OBSERVATIONS. THE MOST
POPULAR SUMMARY OF
IT IS CALLED THE STANDARD
DEVIATION.
25
COMPUTING THE MEASURES OF SPREAD –
VARIANCE AND STANDARD DEVIATION
n

 x              x
2
i
VAR( X )    i 1
n 1
SD( X )  VAR( X )
n

x       i
where x     i 1
n
26
ILLUSTRATION

• HERE ARE THE AGES FOR A SAMPLE OF N = 5
CHILDREN: 1, 3, 5, 7, 9. FIND THE STANDARD
DEVIATION FOR THIS DATA SET

27
INTERQUARTILE RANGE (IQR)

• WE SHALL CONSIDER THE FOLLOWING
DATA SET TO ILLUSTRATE INTERQUARTILE
RANGE (IQR)

DATA: 45, 46, 49, 35, 76, 80, 89, 94, 37, 61,
62, 64, 68, 56, 57, 57, 59, 71, 72.

SORTED DATA: 35, 37, 45, 46, 49, 56, 57,
57, 59, 61, 62, 64, 68, 71,
72, 76, 80, 89, 94.

28
NOTATION

• INTERQUARTILE RANGE (IQR) = Q3 – Q1

Q3 = UPPER QUARTILE

= MEDIAN OF UPPER HALF OF DATA(INCLUDE MEDIAN IF
n IS ODD)

Q1 = LOWER QUARTILE
= MEDIAN OF LOWER HALF OF DATA(INCLUDE MEDIAN
IF n IS ODD)

29
Quartiles
EXAMPLE: (odd number of observations, 19)

Median = 61
UPPER HALF
35 37 45 46 49 56 57 57 59 [61 62 64 68 71 72 76 80 89
94]

Q3 = (71 +72) / 2 = 71.5

LOWER HALF
[35 37 45 46 49 56 57 57 59 61] 62 64 68 71 72 76 80 89
94
Q1 = (49 + 56) / 2 = 52.5
IQR = 71.5 – 52.5 = 19
Note: Include the median in the calculation of both
quartiles IF n = ODD                  30
Quartiles
EXAMPLE: (even number of observations, 18)

35 37 45 46 49 56 57 57 59 [60] [61 62 64 68 71 72 76 80
89 ]

60 = Median = (59+61)/2 (Average of the middle two
numbers)

UPPER HALF
35 37 45 46 49 56 57 57 59 [60] [61 62 64 68 71 72 76 80
89 ]
Q3 = 71
LOWER HALF
[35 37 45 46 49 56 57 57 59 ] 62 64 68 71 72 76 80 89 94
Q1 = 49
31
IQR = 71 – 49 = 42
Classroom Problems
• 1. Here are costs of 10 electric smooth-top
ranges rated very good or excellent by
Consumers Reports in August 2002.

• 850       900   1400      1200       1050
• 1000      750   1250      1050       565

•   Find the following statistics by hand:
•   a) mean
•   b) median and quartiles
•   c) range and IQR                            32
SOLUTION
• Step 1: Sort Data:

565               Mean = 1001.5
750               Median =1025
850               Q1=850
900               Q3=1200
1000              Range = 835
1050              IQR= 350
1050
1200
1250
33
1400
5 – NUMBER SUMMARY
•   THE 5-NUMBER SUMMARY OF A DISTRIBUTION REPORTS ITS
MEDIAN, QUARTILES, AND EXTREMES(MINIMUM AND MAXIMUM)

•   MAX = 94

•   Q3 = 71.5

•   MEDIAN = 61

•   Q1 = 52.5

•   MIN=35

OUTLIERS: DATA VALUES WHICH ARE BEYOND FENCES

IQR = Q3 – Q1 = 19

UPPER FENCE = Q3 + 1.5IQR = 71.5 + 1.5x19 = 100
LOWER FENCE = Q1 – 1.5IQR = 52.5 – 1.5x19 = 24             34
DISPLAYING QUANTITATIVE DATA
(Chapter 4)
WHY DISPLAY DATA?
DATA TABLES DO NOT OFTEN HELP US
SEE (APPRECIATE) WHAT IS GOING ON. WE
NEED WAYS TO SHOW THE DATA SO THAT
WE CAN SEE

•   PATTERNS
•   RELATIONSHIPS
•   TRENDS
•   EXCEPTIONS.

35
BOXPLOTS

WHENEVER WE HAVE A 5-NUMBER SUMMARY OF A
(QUANTITATIVE) VARIABLE, WE CAN DISPLAY THE
INFORMATION IN A BOXPLOT.

• THE CENTER OF A BOXPLOT IS A BOX THAT SHOWS THE
MIDDLE HALF OF THE DATA, BETWEEN THE QUARTILES.

• THE HEIGHT OF THE BOX IS EQUAL TO THE IQR.

• IF THE MEDIAN IS ROUGHLY CENTERED BETWEEN THE
QUARTILES, THEN THE MIDDLE HALF OF THE DATA IS
ROUGHLY SYMMETRIC. IF IT IS NOT CENTERED, THE
DISTRIBUTION IS SKEWED.

• THE MAIN USE FOR BOXPLOTS IS TO COMPARE GROUPS.
36
BOXPLOT OF THE PREVIOUS EXAMPLE

Boxplot of C1
100

90

80

70
C1

60

50

40

30

37
CLASS DISCUSSION

38
HISTOGRAMS
A HISTOGRAM IS A SUMMARY GRAPH
SHOWING A COUNT OF THE DATA FALLING
IN VARIOUS RANGES OR CLASSES OR
GROUPS.

PURPOSE: TO GRAPHICALLY SUMMARIZE
AND DISPLAY THE DISTRIBUITION OF A
PROCESS DATA SET.

39
HISTOGRAM

• It is particularly useful when
there are a large number of
observations.

• The observations or data sets
for which we draw a histogram
are QUANTITATIVE variables.
40
CONSTRUCTING A HISTOGRAM

• A HISTOGRAM CAN BE CONSTRUCTED BY
SEGMENTING THE RANGE OF THE DATA INTO
EQUAL SIZED BINS (ALSO CALLED SEGMENTS,
GROUPS OR CLASSES).

FOR EXAMPLE, IF YOUR DATA RANGES FROM 1.1
TO 1.8, YOU COULD HAVE EQUAL BINS OF 0.1
CONSISTING OF SEGMENTS 1 TO 1.1; 1.1 TO 1.2;
1.2 TO 1.3; 1.3 TO 1.4; AND SO ON.

• THE VERTICAL OR Y AXIS OF THE HISTOGRAM IS
LABELED FREQUENCY (THE NUMBER OF COUNTS
FOR EACH BIN), AND THE HORIZONTAL OR X AXIS
OF THE HISTOGRAM IS LABELED WITH THE RANGE
OF THE RESPONSE VARIABLE.
41
•YOU THEN DETERMINE THE NUMBER OF
DATA POINTS THAT RESIDE WITHIN EACH
BIN AND CONSTRUCT THE HISTOGRAM.

• THE BIN SIZE CAN BE DEFINED BY THE USER, BY
SOME COMMON RULE, OR BY SOFTWARE
METHODS (SUCH AS MINITAB)

• THE BINS AND THE COUNTS IN EACH BIN GIVE THE
DISTRIBUTION OF THE QUANTITATIVE VARIABLE.

• LIKE A BAR CHART, A HISTOGRAM PLOTS THE BIN
COUNTS AS THE HEIGHTS OF BARS.

42
Histogram
• Example: Test   Group    Count
0-9      1
Scores
10-19    2
20-29    3
30-39    4
40-49    5
50-59    4
60-69    3
70-79    2
80-89    2
90-100   1

43
Histogram
Example
(http://cnx.org/content/m10160/latest/)

• Scores of 642 students on a psychology
test. The test consists of 197 items each
graded as "correct" or "incorrect." The
students' scores ranged from 46 to 167.

44
Grouped Frequency Distribution of Psychology
Test
Interval’s Lower Limit   Interval’s upper Limit   Class Frequency

39.5                     49.5                      3
49.5                     59.5                     10
59.5                     69.5                     53
69.5                     79.5                     107
79.5                     89.5                     147
89.5                     99.5                     130
99.5                     109.5                    78
109.5                    119.5                    59
119.5                    129.5                    36
129.5                    139.5                    11
139.5                    149.5                     6
149.5                    159.5                     1
159.5                    169.5                     1       45
Histogram

46
Histograms
• Example : THE WEIGHTS OF 23
“THREE-POUND” BAGS OF APPLES
ARE GIVEN AS FOLLOWS:
• 3.26 3.62 3.39 3.12 3.53 3.30 3.10 3.26
3.19 3.22 3.14 3.39 3.31 3.49 3.41 3.02
3.17 3.20 3.12 3.42 3.36 3.21 3.26

• USE THESE DATA TO CONSTRUCT A
HISTOGRAM FOR THE WEIGHT DATA
47
GROUP FREQUENCY DISTRIBUTION FOR
WEIGHTS OF 3 LB APPLE BAGS WITH BIN = 0.1

BINS              FREQUENCY
2.95 TO 3.05              1
3.05 TO 3.15              4
3.15 TO 3.25              5
3.25 TO 3.35              5
3.35 TO 3.45              5
3.45 TO 3.55              2
3.55 TO 3.65              1
48
Histogram

Histogram of Weights of 3 lb Apple Bags

5

4
Frequency

3

2

1

0
3.0      3.1     3.2      3.3     3.4      3.5   3.6
C1

49
Histogram (Excel)
Frequency
Histogram

10
5                              Frequency
0
3.02 3.17 3.32 3.47 More
Bin
50
Histogram (Minitab Commands)
• Open Minitab
• Click on Graph Histogram Simple-
Ok
• Click on C1Select
• Click on Labels Title (Write the title of
• Click Ok Click Ok

51
Histogram
EXAMPLE 2.
-4.50, -3.25, -1.75, -1.59, -1.44,
-1.22, -1.16, -0.88, -0.75, -0.72,
-0.69, -0.50, -0.50, -0.38, -0.28,
-0.22, -0.16, 0.03, 0.12, 0.34, 0.47,
0.62, 0.69, 0.75, 0.78, 0.81, 1.16,
1.47, 2.06, 2.22, 2.44, 3.28, 3.34,
4.12, 4.31, 5.62 , 5.85
52
FREQUENCY DISTRIBUTION OF CLASS DATA
CLASSES          FREQUENCY
-4.5 TO -3.5          1
-3.5 TO -2.5          1
-2.5 TO -1.5           2
-1.5 TO -0.5           7
-0.5 TO 0.5           10
0.5 TO 1.5            7
1.5 TO 2.5            3
2.5 TO 3.5            2
3.5 TO 4.5            2
4.5 TO 5.5
5.5 TO 6.5            2
53
Histogram

Histogram of class data

10

8
Frequency

6

4

2

0
-4   -2          0          2       4   6
C1

54
Frequency

0
5
10
15
20
-4
.5
-2
.7
75
-1
.0
5
0.
67
5

Bin
2.
4
4.
12
Histogram

5
M
or
e
Histogram

Frequency

55
DESCRIBING THE DISTRIBUTION OF A
QUANTITATIVE VARIABLE FROM
HISTOGRAMS

• WHEN YOU DESCRIBE THE DISTRIBUTION
OF A [QUANTITATIVE] VARIABLE, YOU
THINGS:
•   SHAPE
•   CENTER
•   UNUSUAL FEATURES OR OUTLIERS
56
THE SHAPE OF A DISTRIBUTION
1. DOES THE HISTOGRAM HAVE A SINGLE,
CENTRAL HUMP OR SEVERAL SEPERATED
HUMPS? THESE HUMPS ARE CALLED
MODES.
A HISTOGRAM WITH ONE PEAK IS DUBBED
UNIMODAL; HISTOGRAMS WITH TWO PEAKS
ARE CALLED BIMODAL, AND THOSE WITH
THREE OR MORE PEAKS ARE CALLED
MULTIMODAL. A HISTOGRAM THAT DOESN’T
APPEAR TO HAVE ANY MODE AND IN WHICH
ALL THE BARS ARE APPROXIMATELY THE
SAME HEIGHT IS CALLED UNIFORM.

57
UNIMODAL, BIMODAL, MULTI-MODAL,
UNIFORM HISTOGRAMS

58
2. IS THE HISTOGRAM SYMMETRIC?
• CAN YOU FOLD THE HISTOGRAM ALONG A
VERTICAL LINE THROUGH THE MIDDLE AND HAVE
THE EDGES MATCH PRETTY CLOSELY, OR ARE
MORE OF THE VALUES ON ONE SIDE?
• THE (USUALLY) THINNER ENDS OF A DISTRIBUTION
ARE CALLED TAILS. IF ONE TAIL STRETCHES OUT
FARTHER THAN THE OTHER, THE HISTOGRAM IS
SAID TO BE SKEWED TO THE SIDE OF THE LONGER
TAIL.
• A “SKEWED RIGHT” DISTRIBUTION IS ONE IN WHICH
THE TAIL IS ON THE RIGHT SIDE.
• A “SKEWED LEFT” DISTRIBUTION IS ONE IN WHICH
THE TAIL IS ON THE LEFT SIDE.

59
RIGHT-SKEWED HISTOGRAM

60
SYMMETRIC HISTOGRAM

61
LEFT-SKEWED HISTOGRAM

62
3. DO ANY UNUSUAL FEATURES STICK
OUT?
• UNUSUAL FEATURES OR OUTLIERS ARE
EXTREME VALUES THAT DO NOT APPEAR
TO BELONG WITH THE REST OF THE DATA.
SUCH STRAGGLERS STAND OFF AWAY
FROM THE BODY OF THE DISTRIBUTION.
OUTLIERS CAN AFFECT MANY STATISTICAL
ANALYSES, SO YOU SHOULD ALWAYS BE

63
ILLUSTRATION

64
THE CENTER OF THE DISTRIBUTION:
THE MEDIAN
• THE CENTER IS A VALUE THAT ATTEMPTS
THE IMPOSSIBLE BY SUMMARIZING THE
ENTIRE DISTRIBUTION WITH A SINGLE
NUMBER, A “TYPICAL” VALUE. MEASURES
OF CENTER INCLUDE THE MEAN AND
MEDIAN.
• WHEN A HISTOGRAM IS UNIMODAL AND
SYMMETRIC, WE’D AGREE ON THE CENTER
OF SYMMETRY, WHERE WE WOULD FOLD
THE HISTOGRAM TO MATCH THE TWO
SIDES.
65
•WHEN THE DISTRIBUTION IS SKEWED OR
POSSIBLY MULTIMODAL, DEFINING THE
CENTER IS MORE OF A CHALLENGE.
• CAN THE MIDRANGE = [MAX. + MIN.]/2
HELP OUT?
• NOT AT ALL!!!
• WHY?
• IT IS TOO SENSITIVE TO THE
OUTLYING VALUES TO BE SAFE FOR
SUMMARIZING THE WHOLE
DISTRIBUTION.
66
BEATING THE CHALLENGE
• A MORE REASONABLE CHOICE OF
TYPICAL VALUE IS THE VALUE THAT IS
LITERALLY IN THE MIDDLE, WITH HALF
THE VALUES BELOW IT AND HALF
ABOVE IT. SUCH A MEASURE OF
CENTER IS THE MEDIAN.

67
NOTE THE FOLLOWING

68
• A NUMERICAL SUMMARY OF HOW
TIGHTLY THE VALUES ARE
CLUSTERED AROUND THE CENTER.

– STANDARD DEVIATION
– INTERQUARTILE RANGE (IQR)
– RANGE
SEE LECTURES OF WEEK 3
69
STEM AND LEAF DISPLAY
• HISTOGRAMS PROVIDE AN EASY-TO-
UNDERSTAND SUMMARY OF THE
DISTRIBUTION OF A QUANTITATIVE
VARIABLE, BUT THEY DON’T SHOW THE
DATA VALUES THEMSELVES.
• A STEM AND LEAF DIAGRAM IS AN
EXPLORATORY DATA-ANALYSIS
TECHNIQUE THAT ALLOWS US TO GROUP
DATA WITHOUT LOSING THE ORIGINAL
DATA. WE USE THE LEADING DIGIT(S) AS
THE “STEM” AND THE TRAILING DIGIT(S)
AS THE “LEAVES,” SP THAT THE NUMBERS
THEMSELVES BECOME A GRAPH OF THE
70
DATA.
• TO MAKE A STEM-AND-LEAF DISPLAY, WE CUT
EACH DATA VALUE INTO LEADING DIGITS (WHICH
BECOME THE “STEM”) AND TRAILING DIGITS (THE
“LEAVES”). THEN WE USE THE STEMS TO LABEL
THE BINS.

• STEM-AND-LEAF DISPLAYS CONTAIN ALL THE
INFORMATION FOUND IN A HISTOGRAM AND,
WHEN CAREFULLY DRAWN, SATISFY THE AREA
PRINCIPLE AND SHOW THE DISTRIBUTION. IN
THE INDIVIDUAL DATA VALUES.

• UNLIKE A HISTOGRAM, STEM-AND-LEAF DISPLAYS
ALSO SHOW THE DIGITS IN THE BINS, SO THEY
CAN REVEAL UNEXPECTED PATTERNS IN THE
DATA.

71
EXAMPLE : CONSIDER THE SORTED
AND ROUNDED DATA BELOW.
-4.5, -3.3, -2, -1.8, -1.6, -1.4, -1.2, -0.9, -0.9, -0.8, -0.7, -0.7, -0.5, -0.5, -0.4,
-0.3, -0.2, -0.2, 0.0, 0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.8, 0.8, 1.2, 1.5,
2.1, 2.2, 2.4, 3.3, 3.3, 4.1, 4.3, 5.6
STEM LEAVES
-4 5
-3 3
-2
-1 8642
-0 99877554322
0 013567888
1 25
2 124
3 33
4 13
5 6

72
EXAMPLE : USING THE WEIGHTS OF THE BAGS
OF APPLES GIVEN IN THE EXAMPLE OF SLIDE 47,
CONSTRUCT A STEM-AND-LEAF DIAGRAM.
STEM          LEAVES
3.0            2
3.1            209472
3.2            6621016
3.3            90916
3.4            912
3.5            3
3.6            2

THE WEIGHTS OF THE BAGS RANGE FROM 3.02 TO
3.62, SO CAN USE AS STEMS THE VALUES 3.0 – 3.6.
THE LEAVES ARE DETERMINED BY THE DIGIT
FOUND IN THE HUNDRED’S PLACE OF THE
ORIGINAL DATA.

73
DOTPLOTS
• A DOTPLOT GRAPHS A DOT FOR EACH
CASE AGAINST A SINGLE AXIS.

• IT IS LIKE A STEM-AND-LEAF DISPLAY, BUT
WITH DOTS INSTEAD OF DIGITS FOR ALL
THE LEAVES.

• SOME DOTPLOTS STRETCH OUT
HORIZONTALLY, WITH THE COUNTS ON
THE VERTICAL AXIS, LIKE A HISTOGRAM.
OTHERS RUN VERTICALLY, LIKE A STEM-
AND-LEAF DISPLAY.                      74
Example
THE DATA BELOW GIVE THE NUMBER OF
HURRICANE THAT HAPPENED EACH
YEAR FROM 1944 THROUGH 2000 AS
REPORTED BY SCIENCE MAGAZINE.

• 3,2,1,2,4,3,7,2,3,3,2,5,2,2,4,2,2,6,0,
2,5,1,3,1,0,3,2,1,0,1,2,3,2,1,2,2,2,3,
1,1,1,3,0,1,3,2,1,2,1,1,0,5,6,1,3,5,3

75
Dot Plot For Hurricane Data
Dot plot for hurrican data

0   1   2       3        4      5    6   7
C6

76
DESCRIPTION OF THE DISTRIBUTION
• EACH DOT REPRESENTS A YEAR IN WHICH
THERE WERE THAT MANY HURRICANES.

• THE DISTRIBUTION OF THE NUMBER OF
HURRICANES PER YEAR IS UNIMODAL
• SKEWED TO THE RIGHT
• WITH CENTER AROUND 2 HURRICANES
PER YEAR.
• THE NUMBER OF HURRICANES PER YEAR
RANGES FROM 0 TO 7.
• THERE ARE NO OUTLIERS.

77
DISPLAYING CATEGORICAL DATA AND
CONDITIONAL DISTRIBUTIONS (Chap. 3)

• THE BAR CHART

• THE PIE CHART

78
EXAMPLE: CONSIDER THE TITANIC

• WHO: THE 2201 PEOPLE ON THE TITANIC;

• WHAT (VARIABLES):

– SURVIVAL STATUS (DEAD OR ALIVE);
– TICKET CLASS (FIRST, SECOND, THIRD, CREW);
– GENDER (MALE OR FEMALE);
– WHEN APRIL 14, 1912;
– WHERE NORTH ATLANTIC;
– HOW A VARIETY OF SOURCES AND INTERNET
SITES;
– WHY HISTORICAL INTEREST.

79
ONE VARIABLE ANALYSIS
WHO: THE 2201 PEOPLE ON THE TITANIC
WHAT: TICKET CLASS DISTRIBUTION

FREQUENCY TABLE: A          CLASS       COUNT   % OR
FREQUENCY TABLE LISTS                 OR      RELATI
THE CATEGORIES IN A                   FREQU   VE
ENCY    FREQU
CATEGORICAL VARIABLE
AND GIVES THE COUNT OR                        ENCY
PERCENTAGE OF             FIRST       325     14.766
OBSERVATIONS OF EACH
SECOND      285     12.949
CATEGORY.
THIRD       706     32.076
CREW        885     40.209
TOTAL       2201    100

80
DISTRIBUTION OF A VARIABLE
* GIVES THE POSSIBLE VALUES OF THE VARIABLE, AND
* THE RELATIVE FREQUENCY OF EACH VALUE.
GRAPHICAL DISPLAY OF A DISTRIBUTION OF CATEGORICAL
DATA

BAR CHART                   PIE CHART

PIE CHARTS SHOW THE
WHOLE GROUP OF CASES
AS A CIRCLE. THEY SLICE
(A BAR CHART DISPLAYS THE           THE CIRCLE INTO PIECES
DISTRIBUTION OF A CATEGO-           WHOSE SIZE IS PROPOR-
RICAL VARIABLE, SHOWING THE         TIONAL TO THE FRACTION
COUNTS FOR EACH CATEGORY            OF THE WHOLE IN EACH
NEXT TO EACH OTHER FOR              CATEGORY.
EASY COMPARISON.)

81
BAR CHART OF THE PEOPLE(WHO) ON THE
TITANIC WITH TICKET CLASS DISTRIBUTION(WHAT)

900
800
700
600
500
400
300
200
100
0
FIRST   SECOND   THIRD   CREW

82
PIE CHART OF PEOPLE ON THE TITANIC(WHO)
WITH TICKET CLASS DISTRIBUTION(WHAT)

15%

40%
13%

FIRST
SECOND
THIRD
CREW

32%

83
THE AREA PRINCIPLE: THE AREA OCCUPIED BY A
PART OF THE GRAPH SHOULD CORRESPOND TO
THE MAGNITUDE OF THE VALUE IT REPRESENTS.

TIPS
• FIRST RULE OF DATA ANALYSIS IS ‘MAKE A
PICTURE.’

• BEFORE YOU MAKE A BAR CHART OR A PIE
CHART, ALWAYS CHECK THE CATEGORICAL DATA
CONDITION. THE DATA ARE COUNTS OR
PERCENTAGES OF INDIVIDUALS IN CATEGORIES.

• IF YOU WANT TO MAKE A RELATIVE FREQUENCY
BAR CHART OR PIE CHART, YOU’LL NEED TO
ALSO MAKE SURE THAT THE CATEGORIES DON’T
OVERLAP, SO NO INDIVIDUAL IS COUNTED TWICE.
84
TWO VARIABLES ANALYSIS
• QUESTION: WAS THERE A
RELATIONSHIP BETWEEN THE KIND
OF TICKET A PASSENGER HELD AND
THE PASSENGER’S CHANCES OF
MAKING IT INTO THE LIFEBOAT
(SURVIVAL)?
• TO ANSWER: ANALYZE THE TWO
CATEGORICAL VARIABLES TICKET
CLASS(FIRST, SECOND, THIRD,
85
TO LOOK AT TWO CATEGORICAL VARIABLES
TOGETHER, ARRANGE THE COUNTS IN A TWO – WAY
– TABLE OR CONTINGENCY TABLE

TICKET CLASS
FIRST SEC THIRD CREW TOTAL
S                 OND
U
R   ALIVE   203   118   178    212    711
V
I
V   DEAD    122   167   528    673    1490
A
L
TOTAL 325     285   706    885    2201

86
NOTE:
• BECAUSE THE TABLE SHOWS HOW THE INDIVIDUALS
ARE DISTRIBUTED ALONG EACH VARIABLE,
CONTINGENT ON THE VALUE OF THE OTHER
VARIABLE, SUCH A TABLE IS CALLED A
CONTINGENCY TABLE.
• THE MARGINS OF THE TABLE, BOTH ON THE RIGHT
AND AT THE BOTTOM, GIVE TOTALS.
• THE BOTTOM LINE OF THE TABLE IS JUST THE
FREQUENCY DISTRIBUTION OF THE TICKET CLASS.
• THE RIGHT COLUMN OF THE TABLE IS THE
FREQUENCY DISTRIBUTION OF THE VARIABLE
SURVIVAL.
• WHEN PRESENTED LIKE THIS, IN THE MARGINS OF A
CONTIGENCY TABLE, THE FREQUENCY DISTRIBUTION
OF ONE OF THE VARIABLES IS CALLED MARGINAL 87
DISTRIBUTION.
WERE SECOND-CLASS PASSENGERS MORE LIKELY
TO SURVIVE? QUESTIONS LIKE THIS ARE MORE

TICKET CLASS
S           FIRST       SECOND THIRD     CREW     TOTAL
U
R
ALIVE     203         118      178     212      711
V
I           9.2%        5.4%     8.1%    9.6%     32.3%
V DEAD      122         167      528     673      1490
A           5.6%        7.6%     24%     30.6%    67.7%
L
TOTAL     325         285      706     885      2201
14.8%       12.9%    32.1%   40.2%    100%
MARGINAL DISTRIBUTION     MARGINAL DISTRIBUTION
FOR TICKET CLASS          FOR SURVIVAL STATUS             88
DID THE CHANCE OF SURVIVING THE TITANIC
SINKING DEPEND(CONDITION) ON THE TICKET
CLASS? TO ANSWER, WE CREATE A CONDITIONAL
DISTRIBUTION TABLE.
• PERCENTAGES OF
COLUMN – THE                        TICKET CLASS
WHO IS                     1ST    2ND    3RD    CREW   TOT
RESTRICTED TO
THE NUMBER OF
PASSENGERS IN      S ALIVE 203   118     178    212    711
EACH CLASS.        U       62.5% 41.4%   25.2   24%    32.3
• TYPICAL QUESTION   R                     %             %
WHAT IS THE          V
CONDITIONAL        I DEAD 122    167     528    673    1490
DISTRIBUTION OF    V       37.5% 58.6%   74.8   76%    67.7
A                     %             %
SURVIVAL BY
TICKET CLASS?      L TOT   325   285     706  885      2201
100%   100%   100% 100%     100%
89
CONDITIONAL DISTRIBUTION TABLES:
PERCENTAGES OF ROW:WHO IS RESTRICTED

TICKET CLASS
FIRST   SECOND THIRD       CREW    TOTAL

S
U    ALIVE   203     118     178        212     711
R            28.6%   16.6%   25%        29.8%   100%
V
I    DEAD    122     167     528        673     1490
V
8.2%    11.2%   35.4%      45.2%   100%
A
L
TOTAL   325     285     706        885     2201
14.8%   12.9%   32.1%      40.2%   100%

90
A DISTRIBUTION OF ONE VARIABLE, GIVEN THE
VALUE OF ANOTHER IS CALLED A CONDITIONAL
DISTRIBUTION
• THE DISTRIBUTION OF A         F     S     T     C     T
VARIABLE RESTRICTING          I     E     H     R     O
THE WHO TO CONSIDER           R     C     I     E     T
ONLY A SMALLER GROUP
OF INDIVIDUAL IS CALLED       S     O     R     W     A
A CONDITIONAL                 T     N     D           L
DISTRIBUTION.                       D
A   203   118   178   212   711
L
THE CONDITIONAL            I   28.6 16.6 25% 29.8 100
DISTRIBUTION OF                %    %        %    %
V
TICKET CLASS,
CONDITIONAL ON             E
HAVING SURVIVED
91
THE CONDITIONAL DISTRIBUTION OF TICKET
CLASS, CONDITIONAL ON HAVING PERISHED.

FIRST   SECOND THIRD    CREW    TOTAL

DEAD   122     167     528     673     1490

8.2%    11.2%   35.4%   45.2%   100%

92
INDEPENDENCE
• VARIABLES ARE SAID TO BE INDEPENDENT IF THE
CONDITIONAL DISTRIBUTION OF ONE VARIABLE IS
THE SAME FOR EACH CATEGORY OF THE OTHER.
• IN A CONTIGENCY TABLE, WHEN THE
DISTRIBUTION OF ONE VARIABLE IS THE SAME
FOR ALL CATEGORIES OF ANOTHER, WE SAY THE
VARIABLES ARE INDEPENDENT.

“SEGMENTED BAR CHARTS,” PAGES 24 – 32 OF
THE TEXTBOOK FOR FURTHER UNDERSTANDING]

93
CLASS EXAMPLE

• STUDENTS IN AN             L    M    C    TOT
INTRO STATS COURSE
DESCRIBE THEIR        FE   35   36   6    77
POLITICS AS           MA
“LIBERAL,”            LE
“MODERATE,” OR        MA   50   44   21   115
“CONSERVATIVE.” THE   LE
RESULTS ARE ON THE
TABLE:
TOT 85    80   27   192

94
(A) WHAT PERCENT OF THE CLASS IS MALE [59.9%]
(B) WHAT PERCENT OF THE CLASS CONSIDERS
THEMSELVES TO “CONSERVATIVE”? [14.1%]
(C) WHAT PERCENT OF THE MALES IN THE CLASS
CONSIDER THEMSELVES TP BE “CONSERVTIVE’?
[18.3%]
(D) WHAT PERCENT OF ALL STUDENTS IN THE
CLASS ARE MALES WHO CONSIDER THEMSELVES
TO BE “CONSERVATIVE”? [10.9%]
(E) WHAT PERCENT OF ALL FEMALES IN THE CLASS
ARE “LIBERALS”? [45.45%]
(F) WHAT PERCENT OF ALL MALES IN THE CLASS
ARE “LIBERALS”? [43.47%]

95
(G) FIND THE CONDITIONAL DISTRIBUTIONS (PERCENTAGES)
OF POLITICAL VIEWS FOR THE FEMALES.

(H) FIND THE CONDITIONAL DISTRIBUTIONS
(PERCENTAGES) OF POLITICAL VIEWS FOR THE
MALES.

(I) MAKE A GRAPHICAL DISPLAY THAT COMPARES
THE TWO DISTRIBUTIONS.

96
(J) DO THE VARIABLES POLITICS AND SEX APPEAR
TO BE INDEPENDENT? EXPLAIN.

97

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 13 posted: 9/13/2012 language: Unknown pages: 97
How are you planning on using Docstoc?