Alternative mathematics assessment as a guide
for curriculum innovation
Department of Education in the Natural Sciences and Mathematics
University Eduardo Mondlane, Mozambique
In the Netherlands since 1993 the mathematics curriculum for junior secondary
schools includes applied investigational skills. However, in the first years after
introduction of this curriculum, teachers were struggling with its implementation,
guided by national tests that remained of the paper-and-pencil form, and did not
cover all objectives from the new curriculum.
In a monitoring project, mathematical hands-on tasks in a laboratory environment
were used throughout the country. These practical tasks required investigational
skills and aligned well with the intentions of the new curriculum. The tasks showed
how mathematics assessment could look like under the new curriculum. Thereafter,
the National Institute for Educational Measurement also issued a practical test,
challenging teachers to adapt their teaching to meet the intentions of the new
curriculum in full. In this way, teachers who teach to the test, were better guided to
implement all objectives of the new curriculum.
Of course, there is still much experience to be gained from alternative assessment. A
barrier for the adoption of this kind is of testing is its reliability. However, this is well
compensated by its high level of validity.
Three decades ago Hans Freudenthal and his colleagues started to transform the mathematics
curriculum in the Netherlands with a treatise, known as Realistic Mathematics Education (RME)
(Freudenthal, 1973). In 1993, a common core curriculum based on RME for Dutch junior
secondary schools was legislated. This curriculum emphasized data modelling and interpreting,
visual 3-d geometry, approximation and rules of thumb, the use of calculators and computers
(Kok et al., 1992). The approach to the subject was more practical and investigational.
The new, RME-based curriculum was considerably different from the prior curriculum.
Therefore, a large exercise was undertaken to introduce secondary school mathematics teachers
to the new content and its approach. Assessment practice was adapted according to the new
curriculum. For pragmatic reasons, the format of paper-and-pencil tests was maintained. As a
result, the tests did not require students to apply a number of the skills, although these were
required according to the new curriculum (Kleijne, 1999). Generally, test items based on the new
curriculum describe a life situation, which might not be truly realistic, but is convincing. The test
items require students to apply mathematization skills, in which mathematics is linked to
applications. Figure 1 shows a typical item for grade 8.
To calculate for any girl her future length as a grown-up, the school doctor
usesthe following formula:
Length of father (cm)+ length of mother (cm)- 12
Length of daughter (in cm) 3
Danielle’s father is 1,82 m tall, her mother is 1,68 m.
How tall will Danielle grow according to the formula?
According to the formula, is it possible that a daughter can grow taller than her
father? If you want, you can show your answer with an example?
Figure 1: Exemplary grade 8 item of the new Dutch curriculum
In the item, algebraic expressions are concretized through the use of word formula. Of course,
there are no real-life school doctors in the Netherlands using the given formula, but the situation
is imaginable, as students know that there are “experts” who make reasonable predictions based
on genetic characteristics.
The item is also exemplary, as it displays a general weak point in the assessment of the new
curriculum: the model (the formula) is given, and students are required to perform calculations
using the model, and to derive characteristics of the formula. Students do not need to collect data,
nor to model (derive a formula). This shortage of modelling activities in national assessment had
curricular implications. Generally, teachers have not been compelled to organize classroom work,
in which students are asked to model an observed phenomenon. As a result, students do not
develop modelling skills to the same extent as the other mathematizing skills, although the
intended curriculum requires this.
Just like in many other countries, also in the Netherlands, teachers and textbook writers use the
mandated tests as a guideline for curriculum implementation. Test items describe at a very
concrete level what needs to be learnt (Hawker & Ollerton, 1999). As a result of the limitations of
the newly developed tests, the curriculum implementation went hand-in-hand with a dilution of
the initial ideals. In the first years of the curriculum implementation, teachers‟ practices and
students‟ performances did not align well with the intended curriculum (Van Dormolen, 1999;
IvhO, 1999). To monitor the curriculum implementation process, several projects were carried
out. One of these studies investigated whether grade 8 students‟ ability to apply mathematics in
practical situations could be measured in an alternative way, and not through pure paper-and-
pencil tests. The study used a test with an emphasis on modelling skills. The study also served as
a basis for studying validity and reliability of alternative assessment methods for innovative
curricula. Additionally, the study meant to offer assessment developers and teachers
supplementary options on how the new curriculum could be assessed. The test and some of its
curricular implications are presented in this paper.
Innovating mathematics assessment
All over the world, in the past fifteen years attempts have been made to explore alternatives for
standardised paper-and pencil tests. Labels used for innovative assessment are: performance
assessment, practical assessment, alternative assessment, or authentic assessment (e.g. Burton,
1996; Clarke, 1996; Niss, 1993; Wiggins, 1989). The listed terminology is applied if some of the
following criteria were met: (a) testing through open questions and for higher order skills, (b)
being open to a range of methods or approaches, (c) making students disclose their own
understanding, (d) allowing students to undertake practical work, (e) asking for performances and
products, (f) being as an activity worthwhile for students‟ learning, and (g) integrating real-life
situations and several subjects.
In this paper, alternative assessment will be used as terminology, and I will concentrate on
assessment, which can be applied at a nation-wide scale, for example, to monitor curriculum
developments. In this area of study, a number of issues have emerged. First, formats such as
observation, interviews and portfolio have shown to be labor- and cost-intensive. Second, the
interpretation of students‟ answers can result in unreliable data because of inconsistencies
between examiners (Haines, Izard & Le Masurier, 1993; Kitchen & Williams, 1993) Especially
the coding of borderline answers (which are neither totally correct nor totally incorrect) is
conditional to the coders' background (e.g. coding experience, subject matter knowledge,
teaching experience, etc) (Zuzovsky, 1999). Despite disadvantages, nation-wide alternative
assessments of mathematics have been carried out. For example in 1995, countries participating
in the Third International Mathematics and Science Study (TIMSS) could additionally administer
a laboratory-based assessment at grade 8 level (students at the age of approx. 14 year),
complementing the standard TIMSS written test. This practical test was the TIMSS Performance
Assessment, and it consisted of investigative tasks in science and mathematics (Harmon et al.,
1997). The test was administered in a laboratory environment with unpretentious utensils and
instruments, allowing any classroom to be used for test administration.
A laboratory-based mathematics test
The TIMSS Performance Assessment was developed from the educational vision that seeks
coherence between procedural, declarational and conditional cognition. Students were expected
to investigate systematically, being provided with a practical context (manipulatives and
instruments). They were tested through open-ended tasks like: designing and executing an
experiment, observing and describing observations, looking for regularities, explaining and
predicting measurements, etc. The TIMSS Performance Assessment was administered in a circuit
format, in which students take turns in visiting stations. At each station they found a task, which
guided them to carry out a small investigation. Each task was estimated to take 30 minutes.
Students had to write their answers on a worksheet and hand in products (lumps of plasticine, cut-
out models, etc.). The use of manipulatives was considered very appropriate as these help
students to better understand the context of the question. Instead of describing real life situations
in words, the equipment offered the context into
students‟ hands. Especially second language learners
and students with lower reading abilities were
expected to gain from these circumstances.
The test comprised mathematics tasks, science tasks
or combined tasks (overlapping between science and
mathematics). One example is the task Around the
bend (see Figure 1), which is related to scale drawing
and finding rules: students are given a cardboard
model of a corridor and have to cut rectangles
(modelling furniture). By testing which rectangle fits
through the corridor, they have to find a rule for the
Besides tasks with a mathematical focus, the test also
contained tasks from the natural sciences, in which
science investigations met with mathematical
activities, such as measuring using instruments (using Figure 1: The task Around the bend.
stopwatches, rulers, thermometers, and scales). For example, the science task Rubber Band
included mathematical topics, such as graphing and extrapolating. In this task, a number of
washers were attached to a rubber band. Students had to measure the stretching of the band,
related to the number of washers. With only ten washers given, students were asked to predict the
length of the rubber band, if twelve washers were attached. Details of all tasks can be looked up
in Harmon et al. (1997).
In 1995, the test was administered in 21 countries, amongst which the Netherlands. The test
raised questions on reliability and international comparability; in the international report a league
table of countries was avoided (Harmon et a1., 1997; Zuzovsky, 1999). However, in the
Netherlands, the test was judged to be very valid in light of the new mathematics curriculum
(Kuiper, Bos & Plomp, 1999), to such an extent that the test was replicated in 2000. In this way,
trend results allowed for monitoring the implementation of the new curriculum. Moreover, a
replication could give experience in analyzing issues on validity, reliability and comparability of
performance assessments. Other countries were invited to join the replication of 2000, but not
one country did.
In 1995, the TIMSS Performance Assessment was administered to a random sample of Dutch
grade 8 students (n=437 from 49 secondary schools). In 2000, the test was replicated at a slightly
smaller scale because of financial constraints (n=234 from 27 secondary schools). The research
question was to what extent can a practical, laboratory-based test, such as the TIMSS
Performance Assessment, be a valid supplement for mathematics assessment within the new
curriculum? Of course, within the Dutch context, we were also interested in a trend between
1995 and 2000 in the achievement of Dutch grade 8 students on this test. However, the results of
Dutch students‟ performance are beyond the scope of this paper, but can be looked up in Vos
The TIMSS Performance Assessment was not especially designed to evaluate the new Dutch
curriculum. To gain more insight in the curricular validity of the TIMSS Performance
Assessment in the Netherlands, two tests were carried out. In the first place, an expert appraisal
was carried out on the curricular validity of the test with respect to the Dutch RME-based
intended mathematics curriculum for junior secondary schools. Six experts were invited to assess
the test items. The experts were from (1) a research institute for mathematics curriculum
development, (2) the national institute for curriculum development, (3) the national institute for
educational measurement, (4) an in-service training institute, (5) a pre-service training institute,
and (6) from the association of mathematics teachers. The appraisal showed that the experts
considered eight out of twelve tasks to match well with the intended mathematics curriculum
(Vos, 2002). These eight tasks comprised all five mathematical tasks, but also three tasks from a
hybrid of science and mathematics: Dice, Calculator, Folding, Around the Bend, Packaging,
Rubber Band, Shadows and Plasticine. The other four tasks were from biology, physics and
chemistry. These four tasks were maintained in the test to preserve the sequence of test items, but
they were not considered relevant for the measurement of mathematics achievement.
Besides the expert appraisal, the curricular validity was additionally checked on mathematization,
through an assessment grid, especially designed for modelling and applying mathematics, as in
Kitchen and Williams (1993). The grid contained the following assessment categories: modelling,
rewriting (generalizing and simplifying), interpreting, and reflecting. All test items were allocated
to one of these categories. If appropriate, an item could be fitted into two categories, but then the
weight of that item would be spread. Two curriculum experts (a researcher and a teacher trainer)
were asked to categorize all test items independently. Their inter-rater score was 87% and their
average results are reported in Table 1. For comparison, a standard written RME-based test for
the same level of schooling was analyzed through the same procedure. It was the Afsluitingstoets
voor de Basisvorming 1999 (Final test for the core curriculum 1999), which is developed by the
National Institute for Educational Measurement. The TIMSS Performance Assessment showed to
spread well over the grid. The item categories for modelling and reflecting are better covered in
the TIMSS Performance Assessment. As a result, the TIMSS Performance Assessment can be
considered valid on its spread of mathematization activities.
Table 1: Percentage of test items in a mathematization category, comparison between the TIMSS
Performance Assessment and a standard, written test based on the curriculum.
Modelling Generalizing Simplifying Interpreting Reflecting
TIMSS Performance Assessment 35 20 14 16 15
Standard written test 19 25 13 38 6
Results and discussion
The TIMSS Performance Assessment was carried out in the Netherlands in 1995 and 2000 and
the achievement results were not particularly satisfying (Vos, 2002). These results were probably
caused by the classroom practice, in which students never encounter practical, laboratory based
tasks that required students to model. In classroom practice, assessment practice remained of the
paper-and-pencil format making students read texts about real-life contexts and offering them a
Anecdotal evidence showed that the TIMSS Performance Assessment was an eye-opener to many
Dutch mathematics teachers. During the testing sessions, they observed the tasks and how their
students coped with these. Some teachers admitted that they had never thought mathematics
could be tested through hands-on tasks in a laboratory environment. They associated
manipulatives with „fun mathematics‟ (Moyer, 2002), but the assessment context created a
serious atmosphere. As such, the TIMSS Performance Assessment proved to be exemplary
curriculum material, showing how the objectives of the mathematics curriculum could be
assessed in an alternative way, asking students for investigational activities.
Since the administration of the TIMSS Performance Assessment, another practical test has
entered Dutch national testing practice. In 2000, the Dutch National Institute for Educational
Measurement issued, for the first time in their history, a laboratory-based mathematics test at
junior secondary school level: the Potatoes Task. In this task, the teacher brings a 5 kg bag of
potatoes (the Dutch staple food), asking students to analyze their volume and weight, and
establish a relation between these (Verhage, 2004). According to the developer of the test, he had
looked at the tasks of the TIMSS Performance Assessment for inspiration on alternative
mathematics assessment, which would be easily carried out with readily available materials in a
plain mathematics classroom (Boertien, personal communication). With alternative assessment
receiving authority and official status, teachers are compelled to include a wider range of
curricular objectives in their teaching, thus giving students a chance to practice their practical
skills needed for applying mathematics in small investigations. One reason is, that the alternative
assessment shows at a very concrete level what the curricular objectives are. A second reason is,
that the alternative assessment offers teachers examples of how they can activate their students.
Of course, not all is gained with alternative assessment. For example, testing conditions need to
be well-controlled, because unreliable measurements will indicate incorrect differences between
learners, between schools and between measurements from different years. Vos (2002) describes
how small changes in equipment can destroy the reliability of a measurement. However, when
gaining experience with alternative assessment, reliability of data will improve. And despite a
possible lower reliability, the high validity of alternative assessments can compensate for this.
Burton, L. (1996). Assessment of Mathematics: What is the Agenda? In M. Birenbaum & F.J.R.C. Dochy (Eds.), Alternatives in
Assessment of Achievements, Learning Processes, and Prior Knowledge (pp. 31-62). Dordrecht, NL: Kluwer.
Clarke, D. (1996). Assessment. In A.J. Bishop, K. Clement, C. Keitel, J. Kilpatrick & C. Laborde (Eds.), International Handbook
of Mathematics Education (pp. 327-370). Dordrecht, NL: Kluwer.
Freudenthal, H (1973). Mathematics as an Educational Task. Dordrecht, NL: Reidel.
Harmon, M., Smith, T.A., Martin, M.O., Kelly, D.L., Beaton, A.E., Mullis, I.V.S., Gonzales, E.J. & Orpwood, G. (1997).
Performance Assessment in IEA's Third International Mathematics and Science Study. Boston, MA: Boston College.
Haines, C., Izard, J. & Le Masurier, J. (1993). Modelling Intentions Realised: Assessing the Full Range of Developed Skills. In
T.Breiteig, I.Huntley &G.Kaiser-Messmer, Teaching and Learning Mathematis in Context (pp.200-212). New York: Ellis
Hawker, D., & Ollerton, M. (1999). National tests in mathematics: Two perspectives. Mathematics Teaching, 168, 16-25.
Inspectie van het Onderwijs (IvhO) (1999). Wiskunde in de Basisvorming. Evaluatie van de Eerste Vijf Jaar [Mathematics in the
core curriculum for basic education: evaluation of the first five years]. Utrecht, Netherlands: Author.
Kitchen, A. & Williams, J. (1993). Implementing and Assessing Mathematics Modelling in the Academic 16-19 Curriculum. In
T.Breiteig, I.Huntley&G.Kaiser-Messmer, Teaching and Learning Mathematis in Context (pp.138-150). New York: Ellis
Kleijne, W. (1999). Wiskunde in de basisvorming; evaluatie van de eerste vijf jaar [Mathematics in the core curriculum;
evaluation of the first five years]. Euclides, 75(3), 75-80.
Kok, D., Meeder, M., Wijers, M. & Van Dormolen, J. (1992). Wiskunde 12-16, een Boek voor Docenten [Maths12-16, a book for
teachers]. Utrecht, NL: Freudenthal Institute.
Kuiper, W.A.J.M., Bos, K.Tj., & Plomp, Tj. (1999). Mathematics Achievement in the Netherlands and Appropriateness of the
TIMSS Mathematics Test. Educational Research and Evaluation, 5(2), 85-104.
Lomask, M.S., Baron, J.B. & Greig, J. (1998). Large-scale Science Performance Assessment in Connecticut: Challenges and
Resolutions. In B.J. Fraser, and K.G. Tobin (Eds.) International Handbook on Science Education (pp. 823-844).
Dordrecht, NL: Kluwer.
Moyer, P. (2002). Are We Having Fun Yet? How Teachers Use Manipulatives to Teach Mathematics. Educational Studies in
Mathematics, 47, 175-197.
Niss, M. (Ed.) (1993). Investigations into Assessment in Mathematics Education, an ICMI Study. Dordrecht, NL: Kluwer.
Van Dormolen, J. (1999). Reaktie op het Inspectierapport [Reaction on the Inspectorate‟s Report]. Nieuwe Wiskrant 19(2), 51.
Verhage, H. (2004). Nederland aardappelland [The Netherlands - potato country]. Euclides, 79(5), 140-148.
Vos, F.P. (2002). Like an Ocean Liner Changing Course; the Grade 8 Mathematics Curriculum in the Netherlands 1995-2000.
Enschede, NL: University of Twente.
Wiggins, G. (1989). A True Test: Towards More Authentic and Equitable Assessment. Phi Delta Kappan,76(9), 703-713.
Zuzovsky, R. (1999). Problematic Aspects of the Scoring of the TIMSS Performance Assessment: Some Examples. Studies in
Educational Evaluation, 25(3), 315-323.