WELCOME TO STT200
Document Sample


WELCOME TO STT 200
• INSTRUCTOR: DR. Elijah E. DIKONG
• VISITING PROFESSOR
• COUNTRY: CAMEROON [AFRICA]
• CLASS WEBSITE:
– http://www.stt.msu.edu
1
What Is Statistics?
Statistics: Two Different Meanings:
(a) IN PLURAL SENSE, STATISTICS MEANS A SET
OF OBSERVATIONS, USUALLY COLLECTED BY
MEASUREMENTS OR COUNTING, COLLECTIVELY
KNOWN AS DATA.
(b) IN SINGULAR SENSE, STATISITICS REFERS TO A
GROUP OF SCIENTIFIC METHODS USED TO
* collecting data
* interpreting and analyzing data
* making conclusions or inferences.
2
TYPES OF STATISTICS
DESCRIPTIVE
INFERENTIAL
STATISTICS STATISTICS
STATISTICS REQUIRES SIX PROCEDURES
1.UNDERSTANDING 3.PLANNING 5.CHECKING
4.EXECUTING
2.ANALYZING 6.REPORTING
3
DESCRIPTIVE STATISTICS
• DEFINED AS THOSE METHODS INVOLVING THE COLLECTION,
PRESENTATION, AND CHARACTERIZATION OF A SET OF
DATA IN ORDER TO DESCRIBE PROPERLY THE VARIOUS
FEATURES OF THAT SET OF DATA. TO ACHIEVE THESE,
STATISTICIANS USE TABLES – EITHER FREQUENCY OR
CONTIGENCY; BAR AND PIE CHARTS; STEM-AND-LEAF
DISPLAYS; BOX-AND-WHISKER PLOTS; PARETO DIAGRAMS;
HISTOGRAMS.
• INFERENTIAL STATISTICS
DEFINED AS THOSE METHODS, E.G. PROBABILITY THEORY,
THAT MAKE POSSIBLE THE ESTIMATION OF A
CHARACTERISTIC OF A POPULATION OR THE MAKING OF A
DECISION CONCERNING A POPULATION BASED ONLY ON
SAMPLE RESULTS.
4
RELEVANT
STATISTICAL
TERMINOLOGIES
5
POPULATION VERSUS SAMPLE
• POPULATION – A POPULATION IS THE
TOTAL GROUP OF INDIVIDUALS ABOUT
WHOM YOU WANT TO MAKE
CONCLUSIONS.
• SAMPLE: A REPRESENTATIVE SUBSET OF
THE POPULATION FOR WHOM YOU
ACTUALLY HAVE DATA.
• ILLUSTRATION – “POT OF SOUP”
6
EXAMPLE: IDENTIFY THE
POPULATION AND THE SAMPLE
• A QUESTION POSTED ON THE LYCOS WEBSITE IN THE USA
ON 18 JUNE 2000 ASKED VISITORS TO THE SITE TO SAY
WHETHER THEY THOUGHT MARIJUANA SHOULD BE
LEGALLY AVAILABLE FOR MEDICINAL PURPOSES.
• THE GALLUP POLL INTERVIEWED 1007 RANDOMLY
SELECTED U.S. ADULTS AGED 18 AND OLDER, MARCH 23 –
25, 2007. GALLUP REPORTS THAT WHEN ASKED IF EVER,
THE EFFECTS OF GLOBAL WARMING WILL BEGIN TO
HAPPEN, 60% OF THE RESPONDENTS SAID THE EFFECTS
HAD ALREADY BEGUN. ONLY 11% THOUGHT THAT THEY
WOULD NEVER HAPPEN.
7
DEFINITIONS – PARAMETER VERSUS
STATISTIC
• PARAMETER (POPULATION PARAMETER):
A PARAMETER IS A NUMERICAL SUMMARY
OF THE POPULATION.
• STATISTIC (SAMPLE STATISTIC) – A
STATISTIC IS A NUMERICAL SUMMARY OF
A SAMPLE TAKEN FROM THE POPULATION.
• ILLUSTRATION:
8
DATA: SYSTEMATICALLY RECORDED INFORMATION,
WHETHER NUMBERS OR LABELS, TOGETHER WITH
ITS CONTEXT
CONTEXT TELLS WHO, WHAT, WHEN, WHERE, HOW and
WHY IS BEING MEASURED.
CONTEXT WHERE
PLACE
E.G.
WHAT CITY
CHARACTERISTICS
RECORDED ABOUT WHY
WH0 EACH INDIVIDUAL WHEN
PURPOSE
(VARIABLES)
TIME[DAYS, OF STUDY
INDIVIDUALS ABOUT
WHOM DATA ARE YEARS,
COLLECTED(PARTICIPANTS, ETC.] HOW
RESPONDENTS, SUBJECTS,
EXPERIMENTAL UNITS,
METHOD OF
RECORDS, CASES COLLECTING
DATA. E.G.
9
SURVEY
CLASS DISCUSSION 1
• BECAUSE OF THE DIFFICULTY OF
WEIGHING A BEAR IN THE WOODS,
RESEARCHERS CAUGHT AND MEASURED
54 BEARS, RECORDING THEIR WEIGHT,
NECK SIZE, LENGTH, AND SEX. THEY
HOPED TO FIND A WAY TO ESTIMATE THE
WEIGHT FROM THE OTHER, MORE EASILY
DETERMINED QUANTITIES. IDENTIFY THE
W’S.
10
DATA TABLE – AN ARRANGEMENT OF DATA IN WHICH EACH
ROW REPRESENTS A CASE[AN INDIVIDUAL ABOUT WHOM OR WHICH
WE HAVE DATA] AND EACH COLUMN REPRESENTS A VARIABLE.
NAME AGE TIME AREA NEAREST INTERNET CATALOG ARTIST
STUDIUM PURCHASE NUMBER
(YR) (DAYS) CODE
CATHY 130 312 ALI 7TY73 MASS
22 Y
SAM 24 18 305 LINCO CKJ24 BOST
N
CHRIS 43 368 610 VET JKN23 FLORI
Y
LINDA 5 413 SPAR 7O28Y APRIL
35 Y
11
VARIABLES
DEFINITION: THE CHARACTERISTICS RECORDED ABOUT
EACH INDIVIDUAL ARE CALLED VARIABLES.
TYPES OF VARIABLES
CATEGORICAL QUANTITATIVE
(QUALITATIVE) (NUMERICAL)
OUTCOMES ARE NUMBERS,
OUTCOMES FALL INTO EITHER DISCRETE OR CON-
CATEGORIES. OUTCOMES NUOUS. EXAMPLES:
MAYBE IN WORDS OR *HEIGHTS OF MSU STUDENTS
NUMERALS. EXAMPLES: *NUMBER OF FLOWERS ON A
*COLOR OF EYES(BLUE, BROWN,… PLANT.
*PROFESSION(ENGINEER, *NUMBER OF SUCCESSFUL
FARMER, TEACHER,… SURGERIES AT SPARROW
12
HOSPITAL LAST FALL.
QUANTITATIVE VARIABLES
• DISCRETE QUANTITATIVE VARIABLE: A
VARIABLE IS DISCRETE IF IT TAKES ITS
VALUE FROM A COUNTABLE SET OF
NUMBERS LIKE {0, 1, 2, 3, 4, … }
• CONTINUOUS QUANTITATIVE VARIABLE: A
VARIABLE IS CONTINUOUS IF IT TAKES ITS
POSSIBLE VALUES FROM AN INTERVAL OR
A CONTINUUM LIKE [2, 7], (- 5, 10), OR THE
ENTIRE NUMBER LINE, R.
13
QUANTITATIVE AND
QUALITATIVE(CATEGORICAL) DATA
• DATA COLLECTED FROM A
QUANTITATIVE VARIABLE IS CALLED
QUANTITATIVE DATA.
• EXAMPLES INCLUDE HEIGHT,
WEIGHT, OF STUDENTS. TIME TO
COMPLETE DIFFERENT TASKS.
• DATA COLLECTED FROM A
CATEGORICAL VARIABLE IS CALLED
CATEGORICAL DATA.
14
CLASS WORK 2
IN JUNE 2003 CONSUMER REPORTS PUBLISHED AN ARTICLE
ON SOME SPORT UTILITY VEHICLES THEY HAD TESTED
RECENTLY. THEY REPORTED SOME BASIC INFORMATION
ABOUT EACH OF THE VEHICLES AND THE RESULTS OF SOME
TESTS CONDUCTED BY THEIR STAFF. AMONG OTHER THINGS,
THE ARTICLE TOLD THE BRAND OF EACH VEHICLE, ITS PRICE,
AND WHETHER IT HAD A STANDARD OR AUTOMATIC
TRANSMISSION. THEY REPORTED THE VEHICLE’S FUEL
ECONOMY, ITS ACCELERATION(NUMBER OF SECONDS TO
GO FROM ZERO TO 60MPH), AND ITS BRAKING DISTANCE
TO STOP FROM 60MPH. THE ARTICLE ALSO RATED EACH
VEHICLE’S RELIABILITY BETTER THAN AVERAGE,
AVERAGE, WORSE, OR MUCH WORSE THAN AVERAGE.
IDENTIFY THE W’S. LIST THE VARIABLES. INDICATE WHETHER
EACH VARIABLE IS CATEGORICAL OR QUANTITATIVE. IF
THE VARIABLE IS QUANTITATIVE, TELL THE UNITS.
15
CLASS WORK 3
IN JUNE 2000, A HOMEOWNER IN TUSCOLA, ILLINOIS,
WANTED TO DETERMINE IF GENERIC FERTILIZER
AND WEED KILLER IS AS EFFECTIVE AS THE
MORE EXPENSIVE NAME BRAND PRODUCT.
AFTER THE SPRING RAINS AND EARLY SUMMER
WARMTH, HE COUNTED THE NUMBER OF WEEDS
AND DENSITY OF GRASS BLADES.
IDENTIFY WHO, WHERE, WHEN, AND WHY FOR THE
SITUATION DESCRIBED.
A. A HOMEOWNER; TUSCOLA, ILLINOIS, JUNE 2000,
COMPARE PRODUCTS.
B. TWO PATCHES OF LAWN; TUSCOLA, ILLINOIS;
JUNE 2001; COMPARE PRODUCTS.
C. TWO PATCHES OF LAWN; ARCOLA, ILLINOIS;
JUNE 2000; COMPARE PRODUCTS.
D. A HOMEOWNER; ARCOLA, ILLINOIS; JUNE 2000;
COMPARE PRODUCTS.
E. TWO PATCHES OF LAWN; TUSCOLA, ILLINOIS; 16
JUNE 2000; COMPARE PRODUCTS.
CLASS WORK 4
AN ADMINISTRATOR IN A SCHOOL DISTRICT WITH SEVERAL
FIFTH GRADE CLASSROOMS OF ESSENTIALLY THE SAME
SIZE COLLECT DATA ON THE VARIOUS CLASSES. AMONG
THE VARIABLES WERE THE NUMBER OF SINGLE PARENT
FAMILIES, AVERAGE FAMILY INCOME, STRUCTURE OF
SCHOOL(K-5, 5-8, K-8), NUMBER ELIGIBLE FOR
FREE/REDUCED LUNCH, MAJORITY BRING/BUY
LUNCH(YES/NO), AVERAGE DISTANCE TO SCHOOL, AND
NUMBER OF PARENTAL VISITS TO SCHOOL.
SELECT THE STATEMENT THAT CLASSIFIES THE VARIABLES
IN ORDER WITH Q REPRESENTING A QUANTITATIVE
VARIABLE AND C REPRESENTING A CATEGORICAL
VARIABLE.
(A) C,Q,C,Q,C,Q,Q
(B) Q,C,Q,C,Q,C,C
(C) Q,Q,C,Q,C,Q,C,
(D) C,C,Q,C,Q,C,C.
(E) Q,Q,C,Q,C,Q,Q.
17
MEASURES OF CENTER OF
QUANTITATIVE DATA
• THE CENTER IS A VALUE THAT
ATTEMPTS THE IMPOSSIBLE BY
SUMMARIZING THE ENTIRE
DISTRIBUTION OR DATA SET WITH A
SINGLE NUMBER, A “TYPICAL”
VALUE. MEASURES OF CENTER
INCLUDE THE MEAN AND THE
MEDIAN.
18
DEFINITION
• MEAN: THE MEAN IS THE SUM OF THE
OBSERVATIONS DIVIDED BY THE
NUMBER OF OBSERVATIONS.
• MEDIAN: THE MEDIAN IS THE
MIDPOINT OF THE OBSERVATIONS
WHEN THEY ARE ORDERED FROM
THE SMALLEST TO THE LARGEST (OR
FROM THE LARGEST TO SMALLEST).
19
EXAMPLE
• FIND THE MEAN AND MEDIAN OF THE
SET OF OBSERVATIONS: 7, 1, 5, 3, 4.
• FIND THE MEAN AND MEDIAN OF 4, 2,
8, 6.
20
CHALLENGE QUESTION
• PROFESSOR DIKONG GAVE HIS FIRST
TEST TO HIS STT 200 STUDENTS. HIS
COLLEAGUE IS INTERESTED HOW HIS
STUDENTS PERFORMED IN THE TEST.
• HOW SHOULD PROFESSOR DIKONG
ANSWER IN ORDER TO GIVE HIS
COLLEAGUE A BETTER IDEA OF HOW
HIS STUDENTS PERFORMED IN THE
TEST?
21
OUTLIERS
• OUTLIERS ARE UNUSUAL OR EXTREME
VALUES THAT DO NOT APPEAR TO
BELONG WITH THE REST OF THE DATA.
• SUCH STRAGGLERS STAND OFF AWAY
FROM THE BODY OF THE DISTRIBUTION OF
DATA SET.
• OUTLIERS CAN AFFECT MANY
STATISTICAL ANALYSES, SO YOU SHOULD
ALWAYS BE ALERT FOR THEM.
22
MEASURES OF SPREAD OF
QUANTITATIVE DATA
• A MEASURE OF SPREAD IS A
NUMERICAL SUMMARY OF HOW
TIGHTLY THE VALUES ARE
CLUSTERED AROUND THE CENTER.
• MEASURES OF SPREAD ARE:
– STANDARD DEVIATION
– INTERQUARTILE RANGE (IQR)
– RANGE
23
RANGE = (MAXIMUM OBSERVATION) –
(MINIMUM OBSERVATION)
• EXAMPLE: FIND THE RANGE OF THE
DATA SET: 45, 46, 49, 35, 76, 80, 89, 94,
37, 61, 62, 64, 68, 56, 57, 57, 71, 72
• MAXIMUM OBSERVATION = 94
• MINIMUM OBSERVATION = 35
• RANGE = MAX – MIN = 94 – 35 = 59
24
VARIANCE AND STANDARD
DEVIATION
• THE RANGE USES ONLY THE
LARGEST AND SMALLEST
OBSERVATIONS. THE MOST
POPULAR SUMMARY OF
SPREAD USES ALL THE DATA.
IT IS CALLED THE STANDARD
DEVIATION.
25
COMPUTING THE MEASURES OF SPREAD –
VARIANCE AND STANDARD DEVIATION
n
x x
2
i
VAR( X ) i 1
n 1
SD( X ) VAR( X )
n
x i
where x i 1
n
26
ILLUSTRATION
• HERE ARE THE AGES FOR A SAMPLE OF N = 5
CHILDREN: 1, 3, 5, 7, 9. FIND THE STANDARD
DEVIATION FOR THIS DATA SET
27
INTERQUARTILE RANGE (IQR)
• WE SHALL CONSIDER THE FOLLOWING
DATA SET TO ILLUSTRATE INTERQUARTILE
RANGE (IQR)
DATA: 45, 46, 49, 35, 76, 80, 89, 94, 37, 61,
62, 64, 68, 56, 57, 57, 59, 71, 72.
SORTED DATA: 35, 37, 45, 46, 49, 56, 57,
57, 59, 61, 62, 64, 68, 71,
72, 76, 80, 89, 94.
28
NOTATION
• INTERQUARTILE RANGE (IQR) = Q3 – Q1
Q3 = UPPER QUARTILE
= MEDIAN OF UPPER HALF OF DATA(INCLUDE MEDIAN IF
n IS ODD)
Q1 = LOWER QUARTILE
= MEDIAN OF LOWER HALF OF DATA(INCLUDE MEDIAN
IF n IS ODD)
29
Quartiles
EXAMPLE: (odd number of observations, 19)
Median = 61
UPPER HALF
35 37 45 46 49 56 57 57 59 [61 62 64 68 71 72 76 80 89
94]
Q3 = (71 +72) / 2 = 71.5
LOWER HALF
[35 37 45 46 49 56 57 57 59 61] 62 64 68 71 72 76 80 89
94
Q1 = (49 + 56) / 2 = 52.5
IQR = 71.5 – 52.5 = 19
Note: Include the median in the calculation of both
quartiles IF n = ODD 30
Quartiles
EXAMPLE: (even number of observations, 18)
35 37 45 46 49 56 57 57 59 [60] [61 62 64 68 71 72 76 80
89 ]
60 = Median = (59+61)/2 (Average of the middle two
numbers)
UPPER HALF
35 37 45 46 49 56 57 57 59 [60] [61 62 64 68 71 72 76 80
89 ]
Q3 = 71
LOWER HALF
[35 37 45 46 49 56 57 57 59 ] 62 64 68 71 72 76 80 89 94
Q1 = 49
31
IQR = 71 – 49 = 42
Classroom Problems
• 1. Here are costs of 10 electric smooth-top
ranges rated very good or excellent by
Consumers Reports in August 2002.
• 850 900 1400 1200 1050
• 1000 750 1250 1050 565
• Find the following statistics by hand:
• a) mean
• b) median and quartiles
• c) range and IQR 32
SOLUTION
• Step 1: Sort Data:
565 Mean = 1001.5
750 Median =1025
850 Q1=850
900 Q3=1200
1000 Range = 835
1050 IQR= 350
1050
1200
1250
33
1400
5 – NUMBER SUMMARY
• THE 5-NUMBER SUMMARY OF A DISTRIBUTION REPORTS ITS
MEDIAN, QUARTILES, AND EXTREMES(MINIMUM AND MAXIMUM)
• MAX = 94
• Q3 = 71.5
• MEDIAN = 61
• Q1 = 52.5
• MIN=35
OUTLIERS: DATA VALUES WHICH ARE BEYOND FENCES
IQR = Q3 – Q1 = 19
UPPER FENCE = Q3 + 1.5IQR = 71.5 + 1.5x19 = 100
LOWER FENCE = Q1 – 1.5IQR = 52.5 – 1.5x19 = 24 34
DISPLAYING QUANTITATIVE DATA
(Chapter 4)
WHY DISPLAY DATA?
DATA TABLES DO NOT OFTEN HELP US
SEE (APPRECIATE) WHAT IS GOING ON. WE
NEED WAYS TO SHOW THE DATA SO THAT
WE CAN SEE
• PATTERNS
• RELATIONSHIPS
• TRENDS
• EXCEPTIONS.
35
BOXPLOTS
WHENEVER WE HAVE A 5-NUMBER SUMMARY OF A
(QUANTITATIVE) VARIABLE, WE CAN DISPLAY THE
INFORMATION IN A BOXPLOT.
• THE CENTER OF A BOXPLOT IS A BOX THAT SHOWS THE
MIDDLE HALF OF THE DATA, BETWEEN THE QUARTILES.
• THE HEIGHT OF THE BOX IS EQUAL TO THE IQR.
• IF THE MEDIAN IS ROUGHLY CENTERED BETWEEN THE
QUARTILES, THEN THE MIDDLE HALF OF THE DATA IS
ROUGHLY SYMMETRIC. IF IT IS NOT CENTERED, THE
DISTRIBUTION IS SKEWED.
• THE MAIN USE FOR BOXPLOTS IS TO COMPARE GROUPS.
36
BOXPLOT OF THE PREVIOUS EXAMPLE
Boxplot of C1
100
90
80
70
C1
60
50
40
30
37
CLASS DISCUSSION
38
HISTOGRAMS
A HISTOGRAM IS A SUMMARY GRAPH
SHOWING A COUNT OF THE DATA FALLING
IN VARIOUS RANGES OR CLASSES OR
GROUPS.
PURPOSE: TO GRAPHICALLY SUMMARIZE
AND DISPLAY THE DISTRIBUITION OF A
PROCESS DATA SET.
39
HISTOGRAM
• It is particularly useful when
there are a large number of
observations.
• The observations or data sets
for which we draw a histogram
are QUANTITATIVE variables.
40
CONSTRUCTING A HISTOGRAM
• A HISTOGRAM CAN BE CONSTRUCTED BY
SEGMENTING THE RANGE OF THE DATA INTO
EQUAL SIZED BINS (ALSO CALLED SEGMENTS,
GROUPS OR CLASSES).
FOR EXAMPLE, IF YOUR DATA RANGES FROM 1.1
TO 1.8, YOU COULD HAVE EQUAL BINS OF 0.1
CONSISTING OF SEGMENTS 1 TO 1.1; 1.1 TO 1.2;
1.2 TO 1.3; 1.3 TO 1.4; AND SO ON.
• THE VERTICAL OR Y AXIS OF THE HISTOGRAM IS
LABELED FREQUENCY (THE NUMBER OF COUNTS
FOR EACH BIN), AND THE HORIZONTAL OR X AXIS
OF THE HISTOGRAM IS LABELED WITH THE RANGE
OF THE RESPONSE VARIABLE.
41
•YOU THEN DETERMINE THE NUMBER OF
DATA POINTS THAT RESIDE WITHIN EACH
BIN AND CONSTRUCT THE HISTOGRAM.
• THE BIN SIZE CAN BE DEFINED BY THE USER, BY
SOME COMMON RULE, OR BY SOFTWARE
METHODS (SUCH AS MINITAB)
• THE BINS AND THE COUNTS IN EACH BIN GIVE THE
DISTRIBUTION OF THE QUANTITATIVE VARIABLE.
• LIKE A BAR CHART, A HISTOGRAM PLOTS THE BIN
COUNTS AS THE HEIGHTS OF BARS.
42
Histogram
• Example: Test Group Count
0-9 1
Scores
10-19 2
20-29 3
30-39 4
40-49 5
50-59 4
60-69 3
70-79 2
80-89 2
90-100 1
43
Histogram
Example
(http://cnx.org/content/m10160/latest/)
• Scores of 642 students on a psychology
test. The test consists of 197 items each
graded as "correct" or "incorrect." The
students' scores ranged from 46 to 167.
44
Grouped Frequency Distribution of Psychology
Test
Interval’s Lower Limit Interval’s upper Limit Class Frequency
39.5 49.5 3
49.5 59.5 10
59.5 69.5 53
69.5 79.5 107
79.5 89.5 147
89.5 99.5 130
99.5 109.5 78
109.5 119.5 59
119.5 129.5 36
129.5 139.5 11
139.5 149.5 6
149.5 159.5 1
159.5 169.5 1 45
Histogram
46
Histograms
• Example : THE WEIGHTS OF 23
“THREE-POUND” BAGS OF APPLES
ARE GIVEN AS FOLLOWS:
• 3.26 3.62 3.39 3.12 3.53 3.30 3.10 3.26
3.19 3.22 3.14 3.39 3.31 3.49 3.41 3.02
3.17 3.20 3.12 3.42 3.36 3.21 3.26
• USE THESE DATA TO CONSTRUCT A
HISTOGRAM FOR THE WEIGHT DATA
47
GROUP FREQUENCY DISTRIBUTION FOR
WEIGHTS OF 3 LB APPLE BAGS WITH BIN = 0.1
BINS FREQUENCY
2.95 TO 3.05 1
3.05 TO 3.15 4
3.15 TO 3.25 5
3.25 TO 3.35 5
3.35 TO 3.45 5
3.45 TO 3.55 2
3.55 TO 3.65 1
48
Histogram
Histogram of Weights of 3 lb Apple Bags
5
4
Frequency
3
2
1
0
3.0 3.1 3.2 3.3 3.4 3.5 3.6
C1
49
Histogram (Excel)
Frequency
Histogram
10
5 Frequency
0
3.02 3.17 3.32 3.47 More
Bin
50
Histogram (Minitab Commands)
• Open Minitab
• Click on Graph Histogram Simple-
Ok
• Click on C1Select
• Click on Labels Title (Write the title of
your histogram)
• Click Ok Click Ok
51
Histogram
EXAMPLE 2.
-4.50, -3.25, -1.75, -1.59, -1.44,
-1.22, -1.16, -0.88, -0.75, -0.72,
-0.69, -0.50, -0.50, -0.38, -0.28,
-0.22, -0.16, 0.03, 0.12, 0.34, 0.47,
0.62, 0.69, 0.75, 0.78, 0.81, 1.16,
1.47, 2.06, 2.22, 2.44, 3.28, 3.34,
4.12, 4.31, 5.62 , 5.85
52
FREQUENCY DISTRIBUTION OF CLASS DATA
CLASSES FREQUENCY
-4.5 TO -3.5 1
-3.5 TO -2.5 1
-2.5 TO -1.5 2
-1.5 TO -0.5 7
-0.5 TO 0.5 10
0.5 TO 1.5 7
1.5 TO 2.5 3
2.5 TO 3.5 2
3.5 TO 4.5 2
4.5 TO 5.5
5.5 TO 6.5 2
53
Histogram
Histogram of class data
10
8
Frequency
6
4
2
0
-4 -2 0 2 4 6
C1
54
Frequency
0
5
10
15
20
-4
.5
-2
.7
75
-1
.0
5
0.
67
5
Bin
2.
4
4.
12
Histogram
5
M
or
e
Histogram
Frequency
55
DESCRIBING THE DISTRIBUTION OF A
QUANTITATIVE VARIABLE FROM
HISTOGRAMS
• WHEN YOU DESCRIBE THE DISTRIBUTION
OF A [QUANTITATIVE] VARIABLE, YOU
SHOULD ALWAYS TELL ABOUT FOUR
THINGS:
• SHAPE
• CENTER
• SPREAD
• UNUSUAL FEATURES OR OUTLIERS
56
THE SHAPE OF A DISTRIBUTION
1. DOES THE HISTOGRAM HAVE A SINGLE,
CENTRAL HUMP OR SEVERAL SEPERATED
HUMPS? THESE HUMPS ARE CALLED
MODES.
A HISTOGRAM WITH ONE PEAK IS DUBBED
UNIMODAL; HISTOGRAMS WITH TWO PEAKS
ARE CALLED BIMODAL, AND THOSE WITH
THREE OR MORE PEAKS ARE CALLED
MULTIMODAL. A HISTOGRAM THAT DOESN’T
APPEAR TO HAVE ANY MODE AND IN WHICH
ALL THE BARS ARE APPROXIMATELY THE
SAME HEIGHT IS CALLED UNIFORM.
57
UNIMODAL, BIMODAL, MULTI-MODAL,
UNIFORM HISTOGRAMS
58
2. IS THE HISTOGRAM SYMMETRIC?
• CAN YOU FOLD THE HISTOGRAM ALONG A
VERTICAL LINE THROUGH THE MIDDLE AND HAVE
THE EDGES MATCH PRETTY CLOSELY, OR ARE
MORE OF THE VALUES ON ONE SIDE?
• THE (USUALLY) THINNER ENDS OF A DISTRIBUTION
ARE CALLED TAILS. IF ONE TAIL STRETCHES OUT
FARTHER THAN THE OTHER, THE HISTOGRAM IS
SAID TO BE SKEWED TO THE SIDE OF THE LONGER
TAIL.
• A “SKEWED RIGHT” DISTRIBUTION IS ONE IN WHICH
THE TAIL IS ON THE RIGHT SIDE.
• A “SKEWED LEFT” DISTRIBUTION IS ONE IN WHICH
THE TAIL IS ON THE LEFT SIDE.
59
RIGHT-SKEWED HISTOGRAM
60
SYMMETRIC HISTOGRAM
61
LEFT-SKEWED HISTOGRAM
62
3. DO ANY UNUSUAL FEATURES STICK
OUT?
• UNUSUAL FEATURES OR OUTLIERS ARE
EXTREME VALUES THAT DO NOT APPEAR
TO BELONG WITH THE REST OF THE DATA.
SUCH STRAGGLERS STAND OFF AWAY
FROM THE BODY OF THE DISTRIBUTION.
OUTLIERS CAN AFFECT MANY STATISTICAL
ANALYSES, SO YOU SHOULD ALWAYS BE
ALERT FOR THEM.
63
ILLUSTRATION
64
THE CENTER OF THE DISTRIBUTION:
THE MEDIAN
• THE CENTER IS A VALUE THAT ATTEMPTS
THE IMPOSSIBLE BY SUMMARIZING THE
ENTIRE DISTRIBUTION WITH A SINGLE
NUMBER, A “TYPICAL” VALUE. MEASURES
OF CENTER INCLUDE THE MEAN AND
MEDIAN.
• WHEN A HISTOGRAM IS UNIMODAL AND
SYMMETRIC, WE’D AGREE ON THE CENTER
OF SYMMETRY, WHERE WE WOULD FOLD
THE HISTOGRAM TO MATCH THE TWO
SIDES.
65
•WHEN THE DISTRIBUTION IS SKEWED OR
POSSIBLY MULTIMODAL, DEFINING THE
CENTER IS MORE OF A CHALLENGE.
• CAN THE MIDRANGE = [MAX. + MIN.]/2
HELP OUT?
• NOT AT ALL!!!
• WHY?
• IT IS TOO SENSITIVE TO THE
OUTLYING VALUES TO BE SAFE FOR
SUMMARIZING THE WHOLE
DISTRIBUTION.
66
BEATING THE CHALLENGE
• A MORE REASONABLE CHOICE OF
TYPICAL VALUE IS THE VALUE THAT IS
LITERALLY IN THE MIDDLE, WITH HALF
THE VALUES BELOW IT AND HALF
ABOVE IT. SUCH A MEASURE OF
CENTER IS THE MEDIAN.
67
NOTE THE FOLLOWING
68
SPREAD
• A NUMERICAL SUMMARY OF HOW
TIGHTLY THE VALUES ARE
CLUSTERED AROUND THE CENTER.
• MEASURES OF SPREAD ARE:
– STANDARD DEVIATION
– INTERQUARTILE RANGE (IQR)
– RANGE
SEE LECTURES OF WEEK 3
69
STEM AND LEAF DISPLAY
• HISTOGRAMS PROVIDE AN EASY-TO-
UNDERSTAND SUMMARY OF THE
DISTRIBUTION OF A QUANTITATIVE
VARIABLE, BUT THEY DON’T SHOW THE
DATA VALUES THEMSELVES.
• A STEM AND LEAF DIAGRAM IS AN
EXPLORATORY DATA-ANALYSIS
TECHNIQUE THAT ALLOWS US TO GROUP
DATA WITHOUT LOSING THE ORIGINAL
DATA. WE USE THE LEADING DIGIT(S) AS
THE “STEM” AND THE TRAILING DIGIT(S)
AS THE “LEAVES,” SP THAT THE NUMBERS
THEMSELVES BECOME A GRAPH OF THE
70
DATA.
• TO MAKE A STEM-AND-LEAF DISPLAY, WE CUT
EACH DATA VALUE INTO LEADING DIGITS (WHICH
BECOME THE “STEM”) AND TRAILING DIGITS (THE
“LEAVES”). THEN WE USE THE STEMS TO LABEL
THE BINS.
• STEM-AND-LEAF DISPLAYS CONTAIN ALL THE
INFORMATION FOUND IN A HISTOGRAM AND,
WHEN CAREFULLY DRAWN, SATISFY THE AREA
PRINCIPLE AND SHOW THE DISTRIBUTION. IN
ADDITION, STEM-AND-LEAF DISPLAYS PRESERVE
THE INDIVIDUAL DATA VALUES.
• UNLIKE A HISTOGRAM, STEM-AND-LEAF DISPLAYS
ALSO SHOW THE DIGITS IN THE BINS, SO THEY
CAN REVEAL UNEXPECTED PATTERNS IN THE
DATA.
71
EXAMPLE : CONSIDER THE SORTED
AND ROUNDED DATA BELOW.
-4.5, -3.3, -2, -1.8, -1.6, -1.4, -1.2, -0.9, -0.9, -0.8, -0.7, -0.7, -0.5, -0.5, -0.4,
-0.3, -0.2, -0.2, 0.0, 0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.8, 0.8, 1.2, 1.5,
2.1, 2.2, 2.4, 3.3, 3.3, 4.1, 4.3, 5.6
STEM LEAVES
-4 5
-3 3
-2
-1 8642
-0 99877554322
0 013567888
1 25
2 124
3 33
4 13
5 6
72
EXAMPLE : USING THE WEIGHTS OF THE BAGS
OF APPLES GIVEN IN THE EXAMPLE OF SLIDE 47,
CONSTRUCT A STEM-AND-LEAF DIAGRAM.
STEM LEAVES
3.0 2
3.1 209472
3.2 6621016
3.3 90916
3.4 912
3.5 3
3.6 2
THE WEIGHTS OF THE BAGS RANGE FROM 3.02 TO
3.62, SO CAN USE AS STEMS THE VALUES 3.0 – 3.6.
THE LEAVES ARE DETERMINED BY THE DIGIT
FOUND IN THE HUNDRED’S PLACE OF THE
ORIGINAL DATA.
73
DOTPLOTS
• A DOTPLOT GRAPHS A DOT FOR EACH
CASE AGAINST A SINGLE AXIS.
• IT IS LIKE A STEM-AND-LEAF DISPLAY, BUT
WITH DOTS INSTEAD OF DIGITS FOR ALL
THE LEAVES.
• SOME DOTPLOTS STRETCH OUT
HORIZONTALLY, WITH THE COUNTS ON
THE VERTICAL AXIS, LIKE A HISTOGRAM.
OTHERS RUN VERTICALLY, LIKE A STEM-
AND-LEAF DISPLAY. 74
Example
THE DATA BELOW GIVE THE NUMBER OF
HURRICANE THAT HAPPENED EACH
YEAR FROM 1944 THROUGH 2000 AS
REPORTED BY SCIENCE MAGAZINE.
• 3,2,1,2,4,3,7,2,3,3,2,5,2,2,4,2,2,6,0,
2,5,1,3,1,0,3,2,1,0,1,2,3,2,1,2,2,2,3,
1,1,1,3,0,1,3,2,1,2,1,1,0,5,6,1,3,5,3
75
Dot Plot For Hurricane Data
Dot plot for hurrican data
0 1 2 3 4 5 6 7
C6
76
DESCRIPTION OF THE DISTRIBUTION
• EACH DOT REPRESENTS A YEAR IN WHICH
THERE WERE THAT MANY HURRICANES.
• THE DISTRIBUTION OF THE NUMBER OF
HURRICANES PER YEAR IS UNIMODAL
• SKEWED TO THE RIGHT
• WITH CENTER AROUND 2 HURRICANES
PER YEAR.
• THE NUMBER OF HURRICANES PER YEAR
RANGES FROM 0 TO 7.
• THERE ARE NO OUTLIERS.
77
DISPLAYING CATEGORICAL DATA AND
CONDITIONAL DISTRIBUTIONS (Chap. 3)
• THE BAR CHART
• THE PIE CHART
78
EXAMPLE: CONSIDER THE TITANIC
• WHO: THE 2201 PEOPLE ON THE TITANIC;
• WHAT (VARIABLES):
– SURVIVAL STATUS (DEAD OR ALIVE);
– TICKET CLASS (FIRST, SECOND, THIRD, CREW);
– GENDER (MALE OR FEMALE);
– WHEN APRIL 14, 1912;
– WHERE NORTH ATLANTIC;
– HOW A VARIETY OF SOURCES AND INTERNET
SITES;
– WHY HISTORICAL INTEREST.
79
ONE VARIABLE ANALYSIS
WHO: THE 2201 PEOPLE ON THE TITANIC
WHAT: TICKET CLASS DISTRIBUTION
FREQUENCY TABLE: A CLASS COUNT % OR
FREQUENCY TABLE LISTS OR RELATI
THE CATEGORIES IN A FREQU VE
ENCY FREQU
CATEGORICAL VARIABLE
AND GIVES THE COUNT OR ENCY
PERCENTAGE OF FIRST 325 14.766
OBSERVATIONS OF EACH
SECOND 285 12.949
CATEGORY.
THIRD 706 32.076
CREW 885 40.209
TOTAL 2201 100
80
DISTRIBUTION OF A VARIABLE
* GIVES THE POSSIBLE VALUES OF THE VARIABLE, AND
* THE RELATIVE FREQUENCY OF EACH VALUE.
GRAPHICAL DISPLAY OF A DISTRIBUTION OF CATEGORICAL
DATA
BAR CHART PIE CHART
PIE CHARTS SHOW THE
WHOLE GROUP OF CASES
AS A CIRCLE. THEY SLICE
(A BAR CHART DISPLAYS THE THE CIRCLE INTO PIECES
DISTRIBUTION OF A CATEGO- WHOSE SIZE IS PROPOR-
RICAL VARIABLE, SHOWING THE TIONAL TO THE FRACTION
COUNTS FOR EACH CATEGORY OF THE WHOLE IN EACH
NEXT TO EACH OTHER FOR CATEGORY.
EASY COMPARISON.)
81
BAR CHART OF THE PEOPLE(WHO) ON THE
TITANIC WITH TICKET CLASS DISTRIBUTION(WHAT)
900
800
700
600
500
400
300
200
100
0
FIRST SECOND THIRD CREW
82
PIE CHART OF PEOPLE ON THE TITANIC(WHO)
WITH TICKET CLASS DISTRIBUTION(WHAT)
15%
40%
13%
FIRST
SECOND
THIRD
CREW
32%
83
THE AREA PRINCIPLE: THE AREA OCCUPIED BY A
PART OF THE GRAPH SHOULD CORRESPOND TO
THE MAGNITUDE OF THE VALUE IT REPRESENTS.
TIPS
• FIRST RULE OF DATA ANALYSIS IS ‘MAKE A
PICTURE.’
• BEFORE YOU MAKE A BAR CHART OR A PIE
CHART, ALWAYS CHECK THE CATEGORICAL DATA
CONDITION. THE DATA ARE COUNTS OR
PERCENTAGES OF INDIVIDUALS IN CATEGORIES.
• IF YOU WANT TO MAKE A RELATIVE FREQUENCY
BAR CHART OR PIE CHART, YOU’LL NEED TO
ALSO MAKE SURE THAT THE CATEGORIES DON’T
OVERLAP, SO NO INDIVIDUAL IS COUNTED TWICE.
84
TWO VARIABLES ANALYSIS
• QUESTION: WAS THERE A
RELATIONSHIP BETWEEN THE KIND
OF TICKET A PASSENGER HELD AND
THE PASSENGER’S CHANCES OF
MAKING IT INTO THE LIFEBOAT
(SURVIVAL)?
• TO ANSWER: ANALYZE THE TWO
CATEGORICAL VARIABLES TICKET
CLASS(FIRST, SECOND, THIRD,
CREW) AND SURVIVAL(ALIVE, DEAD)
85
TO LOOK AT TWO CATEGORICAL VARIABLES
TOGETHER, ARRANGE THE COUNTS IN A TWO – WAY
– TABLE OR CONTINGENCY TABLE
TICKET CLASS
FIRST SEC THIRD CREW TOTAL
S OND
U
R ALIVE 203 118 178 212 711
V
I
V DEAD 122 167 528 673 1490
A
L
TOTAL 325 285 706 885 2201
86
NOTE:
• BECAUSE THE TABLE SHOWS HOW THE INDIVIDUALS
ARE DISTRIBUTED ALONG EACH VARIABLE,
CONTINGENT ON THE VALUE OF THE OTHER
VARIABLE, SUCH A TABLE IS CALLED A
CONTINGENCY TABLE.
• THE MARGINS OF THE TABLE, BOTH ON THE RIGHT
AND AT THE BOTTOM, GIVE TOTALS.
• THE BOTTOM LINE OF THE TABLE IS JUST THE
FREQUENCY DISTRIBUTION OF THE TICKET CLASS.
• THE RIGHT COLUMN OF THE TABLE IS THE
FREQUENCY DISTRIBUTION OF THE VARIABLE
SURVIVAL.
• WHEN PRESENTED LIKE THIS, IN THE MARGINS OF A
CONTIGENCY TABLE, THE FREQUENCY DISTRIBUTION
OF ONE OF THE VARIABLES IS CALLED MARGINAL 87
DISTRIBUTION.
WERE SECOND-CLASS PASSENGERS MORE LIKELY
TO SURVIVE? QUESTIONS LIKE THIS ARE MORE
NATURALLY ADDRESSED USING PERCENTAGES.
TICKET CLASS
S FIRST SECOND THIRD CREW TOTAL
U
R
ALIVE 203 118 178 212 711
V
I 9.2% 5.4% 8.1% 9.6% 32.3%
V DEAD 122 167 528 673 1490
A 5.6% 7.6% 24% 30.6% 67.7%
L
TOTAL 325 285 706 885 2201
14.8% 12.9% 32.1% 40.2% 100%
MARGINAL DISTRIBUTION MARGINAL DISTRIBUTION
FOR TICKET CLASS FOR SURVIVAL STATUS 88
DID THE CHANCE OF SURVIVING THE TITANIC
SINKING DEPEND(CONDITION) ON THE TICKET
CLASS? TO ANSWER, WE CREATE A CONDITIONAL
DISTRIBUTION TABLE.
• PERCENTAGES OF
COLUMN – THE TICKET CLASS
WHO IS 1ST 2ND 3RD CREW TOT
RESTRICTED TO
THE NUMBER OF
PASSENGERS IN S ALIVE 203 118 178 212 711
EACH CLASS. U 62.5% 41.4% 25.2 24% 32.3
• TYPICAL QUESTION R % %
WHAT IS THE V
CONDITIONAL I DEAD 122 167 528 673 1490
DISTRIBUTION OF V 37.5% 58.6% 74.8 76% 67.7
A % %
SURVIVAL BY
TICKET CLASS? L TOT 325 285 706 885 2201
100% 100% 100% 100% 100%
89
CONDITIONAL DISTRIBUTION TABLES:
PERCENTAGES OF ROW:WHO IS RESTRICTED
TICKET CLASS
FIRST SECOND THIRD CREW TOTAL
S
U ALIVE 203 118 178 212 711
R 28.6% 16.6% 25% 29.8% 100%
V
I DEAD 122 167 528 673 1490
V
8.2% 11.2% 35.4% 45.2% 100%
A
L
TOTAL 325 285 706 885 2201
14.8% 12.9% 32.1% 40.2% 100%
90
A DISTRIBUTION OF ONE VARIABLE, GIVEN THE
VALUE OF ANOTHER IS CALLED A CONDITIONAL
DISTRIBUTION
• THE DISTRIBUTION OF A F S T C T
VARIABLE RESTRICTING I E H R O
THE WHO TO CONSIDER R C I E T
ONLY A SMALLER GROUP
OF INDIVIDUAL IS CALLED S O R W A
A CONDITIONAL T N D L
DISTRIBUTION. D
A 203 118 178 212 711
L
THE CONDITIONAL I 28.6 16.6 25% 29.8 100
DISTRIBUTION OF % % % %
V
TICKET CLASS,
CONDITIONAL ON E
HAVING SURVIVED
91
THE CONDITIONAL DISTRIBUTION OF TICKET
CLASS, CONDITIONAL ON HAVING PERISHED.
FIRST SECOND THIRD CREW TOTAL
DEAD 122 167 528 673 1490
8.2% 11.2% 35.4% 45.2% 100%
92
INDEPENDENCE
• VARIABLES ARE SAID TO BE INDEPENDENT IF THE
CONDITIONAL DISTRIBUTION OF ONE VARIABLE IS
THE SAME FOR EACH CATEGORY OF THE OTHER.
• IN A CONTIGENCY TABLE, WHEN THE
DISTRIBUTION OF ONE VARIABLE IS THE SAME
FOR ALL CATEGORIES OF ANOTHER, WE SAY THE
VARIABLES ARE INDEPENDENT.
• [PLEASE READ “CONTINGENCY TABLES” AND
“SEGMENTED BAR CHARTS,” PAGES 24 – 32 OF
THE TEXTBOOK FOR FURTHER UNDERSTANDING]
93
CLASS EXAMPLE
• STUDENTS IN AN L M C TOT
INTRO STATS COURSE
WERE ASKED TO
DESCRIBE THEIR FE 35 36 6 77
POLITICS AS MA
“LIBERAL,” LE
“MODERATE,” OR MA 50 44 21 115
“CONSERVATIVE.” THE LE
RESULTS ARE ON THE
TABLE:
TOT 85 80 27 192
94
(A) WHAT PERCENT OF THE CLASS IS MALE [59.9%]
(B) WHAT PERCENT OF THE CLASS CONSIDERS
THEMSELVES TO “CONSERVATIVE”? [14.1%]
(C) WHAT PERCENT OF THE MALES IN THE CLASS
CONSIDER THEMSELVES TP BE “CONSERVTIVE’?
[18.3%]
(D) WHAT PERCENT OF ALL STUDENTS IN THE
CLASS ARE MALES WHO CONSIDER THEMSELVES
TO BE “CONSERVATIVE”? [10.9%]
(E) WHAT PERCENT OF ALL FEMALES IN THE CLASS
ARE “LIBERALS”? [45.45%]
(F) WHAT PERCENT OF ALL MALES IN THE CLASS
ARE “LIBERALS”? [43.47%]
95
(G) FIND THE CONDITIONAL DISTRIBUTIONS (PERCENTAGES)
OF POLITICAL VIEWS FOR THE FEMALES.
(H) FIND THE CONDITIONAL DISTRIBUTIONS
(PERCENTAGES) OF POLITICAL VIEWS FOR THE
MALES.
(I) MAKE A GRAPHICAL DISPLAY THAT COMPARES
THE TWO DISTRIBUTIONS.
96
(J) DO THE VARIABLES POLITICS AND SEX APPEAR
TO BE INDEPENDENT? EXPLAIN.
97
Related docs
Other docs by HC12091303639
Cost of transportation Attach passenger copy of airline ticket or travel agency invoice
Views: 20 | Downloads: 0
Get documents about "