On rubrics, good and bad
Thanks to Colleen for sharing the Patricia Rogers presentation, summarises and links to work
by Jane Davidson on rubrics.
Because a succinct definition of a rubric was not provided in these presentations, I looked for
and found a definition online. In the education field a rubric is “a document that articulates the
expectations for an assignment by listing the criteria, or what counts, and describing levels of
quality from excellent to poor.”
In the development evaluation field, amongst others, they are useful as a means of
aggregating many micro-judgements into one macro judgement. [synthesis] Those micro-
judgements can be made using both qualitative and quantitative data. You might be
interested to note that the UK Independent Committee for Aid Impact (ICAI) has recently
proposed using one, as a means of summarising the findings in each of the evaluations it
commissions. See ICAI’s Approach to Effectiveness and Value for Money”, its first attempt to
spell out its approach to evaluation, especially the explanation of the “traffic light” system in
section 4. You can also see my comments on the paper as a whole here. It includes two
criticisms of the traffic light system. One is the lack of transparency about how the micro-
judgements made during any evaluation process will be made and then used to make
aggregate judgements, in the form of one of the four lights (green, green/amber, amber/red,
red). The second is the lack of clarity about what constitutes success versus failure, a
surprising omission given that the ICAI are emphasising that “Our reports are written to be
accessible to a general readership” and accountability to the UK public is a high priority. I
should have also a third criticism, which is included in the rubric example given by Davidson.
That is there is no option for saying something like “There is not enough information to make
a judgement”. This option has been present in DFID’s own internal project performance rating
system for many years. I suspect that honest evaluators might be tempted to make frequent
use of this option, perhaps too frequent for the likes of bodies the ICAI.
The ICAI example is not evidence that rubrics don’t or can’t work, but it is an example of a
poorly developed one, at least as it has been presented so far.
“The meaning of ICAI’s traffic light ratings”
Rating What it means
Green: The programme meets all or almost all of the criteria for effectiveness
and value for money and is performing strongly. Very few or no
improvements are needed
Green- The programme meets most of the criteria for effectiveness and value
Amber: for money and is performing well. Some improvements should be made.
Amber-Red: The programme meets some of the criteria for effectiveness and value
for money but is not performing well. Significant improvements should be
Red: The programme meets few of the criteria for effectiveness and value for
money. It is performing poorly. Immediate and major changes need to be
Patricia Rogers’ summary of the Davidson presentation provides useful information on how to
develop rubrics, but information is also needed on how to assess their quality, which is
essential part of the last step in their development: “Debate, recalibrate, field test, hone”. In
Jane Davidson presentation Jane highlights two important quality criteria for synthesis
processes generally, including rubrics. That is, they should be both systematic and
transparent. She also points out that evaluative conclusions should combine both descriptive
data and statements about value. Doing so in a way that is both systematic and transparent is
by no means an easy task.
From what I can see through reading Jane’s presentation, it appears that the guidance
provided for the use of each performance grade combines both descriptive and value
judgements in the same sentence, and it is then up to the user to decide whether this
combination is present or absent. They cannot contest the value judgements embedded in the
descriptions provided”. [notwithstanding how consultative and informed their development
There is an alternative, which is to make a much more visible separation between descriptive
and value judgements, which has the effect of widening the users’ choices of how they can
respond and making their judgement process more transparent and systematic. This
approach can be found in what can be called “weighted checklists”, which I have described at
length here. They perform the same function as the rubrics that have been described above,
helping aggregate many particular judgements into one general judgement.
I will describe two examples. The first is a “customer satisfaction” type form that was sent to
me by a company whose services I had used in the UK.
In this form I was asked to provide value information in the first column, by rating the
importance of each service attribute. More descriptive information is then provided in the
second (albeit with a judgement element).
Once the responses have been collected, weighted scores for individual respondents then
can be calculated, along with an average score for all respondents. The process is as follows:
1. Multiply the importance rating x actual performance rating for each item
2. The sum of these is the actual raw score
3. Multiply the importance rating x highest possible performance rating for each item
4. The sum of these is the highest possible raw score
5. Divide the actual raw score (2) by the highest possible raw score (4), to get a
percentage score for the respondent. A high percentage = high degree of satisfaction,
and vice versa
6. Calculate the average percentage score for all the respondents
The overall process is transparent and systematic.
The second example is the development of the Basic Necessities Survey, a poverty
measurement instrument that was used in Ha Tinh province in Vietnam in 1997 and then
repeated with the same population in 2006. In this survey respondents were asked first for
value information: “Which of the items on this list do you think everyone in Vietnam should be
able to have and no one should have to go without?” They were then asked for descriptive
information: “Which of the items on this [same] list does your household have? “ Items where
50% or more of respondents agreed were basic necessities, were deemed as such and given
a weighting which was the percentage of people saying the item was a necessity (which could
be anywhere between 50 and 100%). Individual raw scores were calculated by adding the
weights of all the items they possessed, and converting this to percentages of the maximum
possible score they could have if they had all the basic necessities. More details are available
here. As with the first example above, the process used value and descriptive data and was
transparent and systematic.
There are many other circumstances where such weighted checklists could be used as
means of developing summary judgements about complex developments. The performance
of local schools and health clinics could easily be assessed by an adaptation of the same kind
of instrument, as a kind of public opinion survey. In India Praxis has developed a more
sophisticated version of the same instrument to assess the performance of community based
organisations working on HIV/AIDS issues. Their instrument has a nested structure, whereby
aspects of performance are grouped into together into units, then these into larger units, then
these into larger units, with weightings being applied at each level. Both the weightings and
performance attribute judgements have then been shared and constructively criticised in open
sessions with the CBOs being assessed. Again, it is transparent and systematic, with value
and observational data both clearly visible.
For those who want to explore in more depth, look at “The synthesis problem: Issues and
methods in the combination of evaluation results into overall evaluative conclusions” by
Michael Scriven, Claremont Graduate University, E. Jane Davidson, CGU & Alliant University,
A demonstration presented at the annual meeting of the American Evaluation Association,
Honolulu, HI, November 2000. This examines some of the problems with weighted scoring
mechanisms, and some of the possible solutions. Thanks to Patricia Rogers for referring me
to this paper some time ago.