The Value in Evaluation
Document Sample


The Value in Evaluation
Erica Friedman
Assistant Dean of Evaluation
MSSM
DOES Longitudinal Clinical
Observation OSCE
SHOWS SP
HOW Practical Oral
Application
Essay Note
KNOWS MEQ review
HOW
MCQ
KNOWS
Written Observed
FORMATIVE SUMMATIVE
Session Goals
Understand the purposes of assessment
Understand the framework for selecting and
developing assessment methods
Recognize the benefits and limitations of different
methods of assessment
Conference Objectives
Review the goals and objectives for your course or
clerkship in the context of assessment
Identify the best methods of assessing your goals
and objectives
Purpose of Evaluation
To certify individual competence
To assure successful completion of goals/objectives
To provide feedback
To students
To faculty, course and clerkship directors
As a statement of values (what is most critical to
learn)
For Program Evaluation- evaluation of an aggregate,
not an individual (ex. average ability of students to
perform a focused history and physical)
Consequences of evaluation
Steering effect- exams “drive the learning”-
students study/learn for the exam
Impetus for change (feedback from
students, Executive Curriculum, LCME)
Definitions- Reliability
The consistency of a measurement over time or
by different observers- ( ex. a thermometer always
reads 98 degrees C when placed in boiling,
distilled water at sea level)
The proportion of variability in a score due to the
true difference between subjects (ex. The
difference between Greenwich time and the time
on your watch)-
Inter-rater reliability (correlation between scores of 2
raters)
Internal reliability (correlation between items within an
exam)
Definitions-Validity
The ability to measure what was intended (the
thermometer reading is reliable but not valid)
Four types-
Face/content
Criterion
Construct/predictive
Internal
Types of validity
Face/content- Would experts agree that it assesses what’s
important?-(driver’s test mirroring actual driving situation and
conditions)
Criterion- draw an inference from test scores to actual performance.
Ex. if a simulated driver’s test score predicts the road test score, the
simulation test is claimed to have a high degree of criterion validity.
Construct/predictive- does it assess what it intended to assess (ex.
Driver’s test as a predictor of the likelihood of accidents- results of
your course exam predict the student’s performance on that section
of Step 1)
Internal- do other methods assessing the same domain obtain similar
results (similar scores from multiple SPs assessing history taking
skills)
Types of Evaluations- Formative
and Summative Definitions:
Formative evaluation- provide feedback so
the learner can modify their learning
approach- “When the chef tastes the sauce,
that’s formative evaluation”
Summative evaluation- done to decide if a
student has met the minimum course
requirements (pass or fail)- usually judged
against normative standards- “when the
customer tastes the sauce, that’s summative
evaluation”
Conclusions about formative
assessments
Stakes are lower (not determining passing or
failing, so lower reliability is tolerated)
Desire more information, so they may require
multiple modalities (it is rare for one assessment
method to identify all critical domains) for validity
and reliability
Use evaluation methods that support and reinforce
teaching modalities and steer students’ learning
May only identify deficiencies but not define how
to remediate
Conclusions about summative assessments
Stakes are higher- students may pass who are
incompetent or may fail and require remediation
Desire high reliability (>0.8) so often require
multiple questions/problems or cases (20-30
stations/OSCE, 15-20 cases for oral presentations,
700 questions for an MCQ)
Desire high content validity (single cases have low
content validity and are not representative)
Desire high predictive validity (correlation with
future performance), which is often hard to
achieve
Consider reliability, validity, benefit and cost
(resources, time and $) in determining the best
assessment tools
How to Match Assessment to Goals
and Teaching Methods
Define the type of learning (lecture, small
group, computer module/self study, etc)
Define the domain to be assessed
(knowledge, skill, behavior) and the level of
performance expected (knows, knows how,
shows how or does)
Determine the type of feedback required
Purpose of feedback
For students: To provide a good platform to
support and enhance student learning
For faculty: To determine what works (what
facilitated learning and who were
appropriate role models)
For students and faculty: To determine
areas that require improvement
Types of Feedback
Quantitative
Total score compared to other students, providing
the high, low and mean score and minimum
requirement for passing grade
Qualitative
Written personal feedback identifying areas of
strength and weakness
Oral feedback one on one or in a group to discuss
the areas of deficiency to help guide further
learning
Evaluation Bias-Pitfall
Can occur with any evaluation requiring
interpretation by an individual (all methods other
than MCQ)
Expectation bias (halo effect)- prior knowledge or
expectation of the outcome influences the ratings
(especially a global rating)
Audience effect- a learner’s performance is
influenced by the presence of an observer (seen
especially with skills and behaviors)
Rater traits- the training of the rater or the rater’s
traits affect the reliability of the observation
Types of assessment tools-
Written
Does not require an evaluator to be present during
the assessment and can be open or closed book
• Multiple choice question (MCQ)
• Modified short answer essay question (MEQ)-
Patient management problem is a variation of this
• Essay
• Application test
• Medical note/chart review
Types of assessment tools-
Observer Dependent Interaction
Usually requires active involvement of an
assessor and occurs as a single event
• Practical
• Medical record review
• Standardized Patient(s) (SP)
• Objective Structured Clinical Examination
(OSCE)
• Oral examination- chart stimulated recall;
triple jump or direct observation
Types of assessment tools-
Observer Dependent
Longitudinal Interaction
Continual evaluation over time
• Preceptor evaluation – either completion of
a a critical incident report or structured
rating form based on direct observation over
time
• Peer evaluation
• Self evaluation
MCQ
Definition: A test composed of questions on which each stem is
followed by several alternative answers. The examinee must select
the most correct answer.
Measures: Knows and Knows how
Pros: Efficient; cheap; samples large content domain (60
questions/hour); high reliability; easy objective scoring, direct
correlate of knowledge with expertise
Cons: Often a recall of facts; provides opportunity for guessing
(good test-taker); unrealistic; doesn’t provide information about the
thought process; encourages learning to recall
Suggestions: Create questions that can be answered from the stem
alone; avoid always, frequently, all or none; randomly assign correct
answers; can correct for guessing (penalty formula)
MEQ
Definition: A series of sequential questions in a linear format based
on an initial limited amount of information. It requires immediate
short answers followed by additional information and subsequent
questions. (patient management problem is a variation of this type)
Measures: Knows and Knows how
Pros: Can assess problem solving, hypothesis generation and data
interpretation
Cons: Low inter-case reliability; less content validity; harder to
administer; time consuming to grade and variable inter-rater
reliability
Suggestions: Use directed (not open ended) questions; provide
extensive answer key
Open ended essay question
Definition: Question allowing a student the freedom to
decide the topic to address and the position to take- it
can be take home
Measures: Knows, Knows how
Pros: Assesses ability to think (generate ideas, weigh
arguments, organize information, build and support
conclusions and communicate thoughts; high face
validity
Cons: Low reliability; time intensive to grade; narrow
coverage of content
Suggestions: strictly define the response and the rating
criteria
Application test
Definition: Open book problem solving test incorporating
a variety of MCQs and MEQs. It provides a description of
a problem with data. The examinee is asked to interpret
the data to solve the problem. (ex. Quiz item 3)
Measures: Knows and knows how
Pros: Assesses higher learning; good face/content
validity; reasonable reliability; useful for formative and
summative feedback
Cons: Harder to create and grade
Practical Exam
Definition: Hands on exam to demonstrate and apply knowledge
(ex. Culture and identify the bacteria on the glove of a Sinai
cafeteria worker, or performance of a history and physical on a
patient)
Measures: Know, knows how , and ? shows how and does
Pros: Can test multiple domains, actively involves the learner (good
steering effect); best suited for procedural/technical skills; higher
face validity
Cons: Labor intensive (creation and grading); hard to identify gold
standard, so subjective grading; high rate of item failure
(unanticipated problems with administration)
Suggestions: Pilot first; adequate, specific instructions and goals;
specific, defined criteria for grading and train raters; for direct
observation, require multiple encounters for higher reliability
Medical record/note review
Definition: Examiner reviews learner’s previously created
document; can be random
Measures: Knows how and Does
Pros: Can review multiple records for higher reliability;
high face validity; less costly than oral (done without
learner and at examiner’s convenience)
Cons: Lower inter-rater reliability; less immediate
feedback; unable to determine basis for decisions
Suggestions: Create a template with specific ratings for
skills
Standardized Patients
Definition: Simulated patient/actor trained to present
history in reliable, consistent manner and to use a
checklist to assess students skills and behaviors
Measures: Knows, Knows how, Shows how and Does
Pros: High face validity; can assess multiple domains; can
be standardized; can give immediate feedback
Cons: Costly; labor intensive; must use multiple SPs for
high reliability
OSCE (Objective Structured Clinical
Exam)
Definition: Task oriented, multi-station
exam; stations can be 5-30 minutes and
require written answers or observation (ex.
Take orthostatic VS; perform a cardiac
exam; smoking cessation counseling; read
and interpret CXR or EKG results;
communicate lab results and advise a
patient
Measures: Knows, Knows how, Shows how
and Does
OSCE (Objective Structured Clinical
Exam)
Pros: Assesses clinical competency; tests a wide
range of knowledge, skills and behaviors; can give
immediate feedback; good test-retest reliability;
good content and construct validity; less patient
and examiner variability than with direct
observation
Cons: Costly (manpower and $); case specific;
requires > 20 stations for internal consistency;
weaker criterion validity
Oral Examination
Definition: Method of evaluating a
learner’s knowledge by asking a series of
questions. The process is open ended with
the examiner directing the questions. –(ex.
chart stimulated patient recall or a triple
jump)
Measures: Knows, Knows how, sometimes
Shows how and does
Oral Exam
Pros: Can measure clinical judgement,
interpersonal skills (communication) and
behavior; high face validity; flexible; can provide
direct feedback
Cons: Poor inter-rater reliability (dove vs hawk
and observer bias); content specific so low
reliability (must use > 6 cases to increase
reliability); labor intensive
Suggestions: multiple short cases; define
questions and answers; provide simple rating
scales and train raters
Triple Jump
Definition: Three step written and oral exam- written,
research and then oral part- (ex. COMPASS 1)
Measures: Knows, knows how, shows how and does
Pros: Assesses hypothesis generation, use of resources,
application of knowledge to problem solve and self
directed learning; provides immediate feedback; high face
validity
Cons: only for formative assessment (poor reliability);
time/faculty intensive; too content specific and
inconsistent rater evaluations
Clinical Observations
Definition: Assessment of various domains
longitudinally by an observer- either
preceptor, peer or self (small group
evaluations during first two years and
preceptor ratings during clinical exposure)
Measures: Knows, knows how, Shows how
and Does
Pros: Simple; efficient; high face validity;
formative and summative
Clinical Observations
Cons: low reliability (only recent encounters
often influence grade); halo effect (lack of domain
discrimination); more often a judgement of
personality and “Lake Woebegone” effect (all
students are necessarily above average);
unwillingness to document negative ratings (fear
of failing someone)
Suggestions: Frequent ratings and feedback;
increase the number of observations; multiple
assessors (with group discussion about specific
ratings)
Peer/Self Evaluation
Pros: Useful for formative feedback
Cons: Lack of correlation with faculty
evaluations; same cons as others (measure of “nice
guy”, low reliability, halo effect- peer evaluations
have friend effect or fear of retribution or desire to
penalize
Suggestions: limit the # of behaviors assessed;
clarify the difference between evaluation of
professional and personal aspects; develop
operationally proven criteria for rating; provide
multiple opportunities for students to do this and
provide feedback from faculty
Erica Friedman’s Educational Pyramid
Direct Observation,
Does Practical
Shows OSCE, T Jump
How Oral, SP, Practical
Chart review
MEQ,Essay
Knows How
Knows MCQ
Type of assessment Content areas Potential Potential
tested Reliability Validity
Written
MCQ Knows, Knows How ++ ++
MEQ Knows, Knows How + ++
Application test, Practical Knows, Knows + +
How, Shows How
Essay Knows, Knows How 0 0
Medical Note review Knows How, Does + ++
Observed
SP Shows How, Does ++ ++
OSCE Shows How, Does ++ ++
Oral Shows How, Does + +
Longitudinal clinical experience Shows How, Does 0 0
Critical factors for choosing an
evaluation tool
Type of evaluation and feedback desired:
formative/summative
Focus of evaluation:
Knowledge, skills, behaviors (attitudes)
Level of evaluation:
Know, Knows how, Shows how, Does
Pros/Cons:
Validity, Reliability, Cost (time, $ resources)
How to be successful
Students should be clear about the
course/clerkship goals and the specifics about the
types of assessments used and the criteria for
passing (and if relevant, just short of honors and
honors)
Make sure the choice of assessments is consistent
with the values of your course and the school
Final judgments about students’ progress should
be based on multiple assessments using a variety
of methods over a period of time (instead of one
time point)
Number of courses or clerkships using a specific
assessment tool-assessing our assessment methods
Year/ M M Essay Prac- SP OSCE Oral Small # #
# courses C E tical group or using using
or Q Q hands preceptor 1 tool > 1
clerkships -on tool
1 n=11 9 5 1 3 1 - - 5 4 7
2 n=13 12 4 4 - 1 1 1 3 4 9
1
3 n=8 7 - Simu- 2 - 7 8 - 8
lator
1
4 n=4 1 - - Simu- - - - 4 2 2
lator
Why assess ourselves?
Assure successful completion of our course
goals and objectives
Assure integration with the mission of the
school
Direct our teaching/learning-(determine
what worked and what needs changing)
How we currently assess ourselves
Student evaluations (quantitative and qualitative)-
most often summative
Performance of students on our exam and specific
sections of USMLE
Focus and feedback groups (formative and
currently done by Dean’s office)
Peer evaluations of course/clerkship- by ECC
Self evaluations- yearly grid completed by course
directors and core faculty
Consider peer evaluation of teaching and teaching
materials
Get documents about "