Alignment Analysis of Mathematics Standards and Assessments by malj



          Issues Related to Judging the Alignment of Curriculum
                        Standards and Assessments

                                    Norman L. Webb

                        Wisconsin Center for Education Research

                            University of Wisconsin–Madison

      Annual Meeting of the American Educational Research Association Meeting,

                                 Montreal, April 11, 2005

This work was supported by a subgrant from the U. S. Department of Education
(S368A030011) to the State of Oklahoma and a grant from the National Science
Foundation, (EHR 0233445) to the University of Wisconsin–Madison. Any opinions,
findings, or conclusions are those of the author and do not necessarily reflect the view of
the supporting agencies.
 Issues Related to Judging the Alignment of Curriculum Standards and
                                    Norman L. Webb


         Alignment among policy documents, curriculum materials, and instructional
practice has received increased importance over the past 10 to 15 years. In the early
1990s, a major tenet of the efforts toward systemic reform was to have the system
components aligned with one another (Smith & O’Day, 1991). The Title I reauthorization
legislation included the requirement that states use assessments aligned with curriculum
standards, a requirement very much attuned to the theory of systemic reform, Elementary
and Secondary Education Act, of 1965 (ESEA)—the Improving Americas School Act, of
1994 (IASA). Continuing with the same principle, the No Child Left Behind Act of 2001
made assessment in reading and mathematics more explicit and built in the requirement
that states would have to indicate that their assessments in grades 3 through 8 and once
during high school are aligned with challenging academic content standards.

        Aware of the increasing importance of alignment, Webb (1997) wrote a
monograph on the criteria for judging alignment for the National Institute for Science
Education, encouraged by Andrew Porter, who was the principle investigator for the
institute. This monograph discussed in some detail methods states and other jurisdictions
used to determine alignment and the criteria that can to used to evaluate the alignment of
a system. The monograph was written as one document related to the study of the
evaluations of systemic reform motivated by the National Science Foundation’s systemic
reform program and in close cooperation with the Council of Chief State School Officers
(CCSSO). The monograph describes in detail criteria that can be used to judge the
alignment between standards and assessments within an educational system. Five major
alignment criteria developed by Webb include: content focus, pedagogical implications,
equity, articulation across grades and ages, and system applicability.

        During the mid-1990s, CCSSO devoted significant effort to analyzing state
standards and was interested in a process for analyzing agreement between state
standards and assessments. In cooperation with CCSSO, Webb then developed a process
for doing a content analysis for judging the alignment between standards and
assessments. This content analysis used four of six criteria identified under major content
focus criteria described in the alignment monograph—categorical concurrence, depth-of-
knowledge correspondence, range-of-knowledge consistency, and balance of

        In 1998, the newly developed alignment process was used for the first time to
analyze, with the cooperation of CCSSO, the alignment of curriculum standards and
assessment of four states. Four to five reviewers coded the depth-of-knowledge (DOK)
levels of standards and the assessment items using paper-and-pencil forms. These data
were hand-entered into an Excel file and then analyzed using procedures developed with
the help of John Smithson.

        Over the next two years, the alignment process was refined and used to conduct
alignment analyses in additional states. The definitions for the depth-of-knowledge
(DOK) levels for four content areas (reading and writing, mathematics, science, and
social studies) were written and refined after each analysis. Another monograph on the
alignment process was published by CCSSO in 2002 (Webb, 2002).

        In 2003, the state of Oklahoma, with the cooperation of CCSSO and the Technical
Issues for Large Scale Assessment (TILSA) collaborative, received a grant from the
United States Department of Education to develop an electronic tool on a CD that could
be used to do the alignment analysis. The work on the electronic tool was begun with the
support of a grant from the National Science Foundation in 2002 for the purpose of
providing technical assistance to the initiative to create Mathematics and Science
Partnerships among K–12 school districts and institutions of higher education. The major
work on the Web Alignment Tool (WAT) began in 2003. The alpha test for the WAT
was conducted in Delaware in August, 2003, by analyzing standards and assessments
from three states—Delaware English Language Arts (grades 3, 5, 8, and 10), mathematics
(3 and 8), and science (4, 6, 8, and 11); South Carolina English Language Arts (grade
10) and science (high school biology); and Oklahoma (mathematics grade 8 and Algebra
I) and science (high school biology). The on-line beta test of the tool was conducted in
Delaware in September, 2002, for mathematics grades 5 and 10. The beta test of the CD
version of the WAT was conducted in Alabama in January, 2004, for mathematics 3, 5, 7,
and 9.

        In 2004, the on-line WAT was used to conduct additional analyses for four states.
Currently, the WAT exists as an on-line tool (, and on a
CD. One dissemination conference on how to use the alignment tools has been conducted
for states west of the Mississippi on February 28 and March 1, 2005, in Phoenix. A
second dissemination conference is to be conducted for states east of the Mississippi in
Boston on July 25 and 26, 2005.

        The Webb alignment process is one of a handful of other processes (Blank, 2002).
Porter and Smithson (Porter, 2002) developed a process referred to as the Survey of the
Enacted Curriculum (SEC). Central to this process is a content-by-cognitive level matrix.
Reviewers systematically categorize standards, assessments, curriculum, or instructional
practices onto the matrix indicating the degree of emphasis in each cell. Comparisons, or
the degree of alignment, are made by considering the amount of overlap of cells on the
matrix between any two elements of the analysis (assessment and standards, curriculum
and standards, standards and instruction, etc.).

       Achieve, Inc., has developed another process that is based on a group of experts
reaching consensus on the degree to which the assessment-by-standard mapping
conducted by a state or district is valid. This process reports on five criteria: Content
Centrality, Performance Centrality, Source of Challenge, Balance, and Range. For
Content Centrality and Performance Centrality, reviewers reach a consensus as to
whether the item and the intent objective(s) correspond fully, partially, or not at all.
Achieve prepares an extensive narrative to describe the results from the review and will
include a ―policy audit‖ of standards and the assessment system if desired.

                                Webb Alignment Process

        Generally, the alignment process is performed during a three-day Alignment
Analysis Institute. The length of the institute is dependent on the number of grades to be
analyzed, the length of the standards, the length of the assessments, and the number of
assessment forms. Five to eight reviewers generally do each analysis. The larger number
of reviewers will increase the reliability of the results. Reviewers should be content-area
experts, district content-area supervisors, and content-area teachers.

        To standardize the language, the process employs the convention of standards,
goals, and objectives to describe three levels of expectations for what students are to
know and do. Standard is used here as the most general (for instance, Data Analysis and
Statistics). A standard, most of the time, will be comprised of a specific number of goals,
which are comprised in turn of a specific number of objectives. Generally, but not
always, there is an assumption that the objectives are intended to span the content of the
goals and standards under which they fall.

        Reviewers are trained to identify the depth-of-knowledge of objectives and
assessment items. This training includes reviewing the definitions of the four depth-of-
knowledge (DOK) levels and then reviewing examples of each. Then the reviewers
participate in 1) a consensus process to determine the depth-of-knowledge levels of the
state’s objectives and 2) individual analyses of the assessment items of each of the
assessments. Following individual analyses of the items, reviewers participate in a
debriefing discussion in which they give their overall impressions of the alignment
between the assessment and the state’s curriculum standards.

         To derive the results on the degree of agreement between the state’s standards and
each assessment, the reviewers’ responses are averaged. Any variance among reviewers
is considered legitimate, with the true depth-of-knowledge level for the item falling
somewhere in between two or more assigned values. Such variation could signify a lack
of clarity in how the objectives were written, the robustness of an item that can
legitimately correspond to more than one objective, and/or a depth of knowledge that
falls in between two of the four defined levels. Reviewers were allowed to identify one
assessment item as corresponding to up to three objectives—one primary hit (objective)
and up to two secondary hits. However, reviewers can only code one depth-of-knowledge
level to each assessment item, even if the item corresponded to more than one objective.
Finally, in addition to learning the process, reviewers are asked to provide suggestions for
improving the process.

       Reviewers are instructed to focus primarily on the alignment between the state
standards and the various assessments. However, they are encouraged to offer their
opinions on the quality of the standards, or of the assessment activities/items, by writing a
note about the items. Reviewers can also indicate whether there is a source-of-challenge
issue with the item—i.e., a problem with the item that might cause the student who
knows the material to give a wrong answer, or enable someone who does not have the
knowledge being tested to answer the item correctly. For example, a mathematics item
that involves an excessive amount of reading may represent a source-of-challenge issue
because the skill required to answer is more a reading skill than a mathematics skill.
Source-of-challenge can be considered as a fifth alignment criteria in the analysis and
was originally so defined by the Achieve, Inc.

         The results produced from the institute pertain only to the issue of agreement
between the state standards and the assessment instruments. Thus, the alignment analysis
does not serve as external verification of the general quality of a state’s standards or
assessments. Rather, only the degree of alignment is discussed in the results. The
averages of the reviewers’ coding are used to determine whether the alignment criteria
are met. When reviewers do vary in their judgments, the averages lessened the error that
might result from any one reviewer’s finding. Standard deviations, which give one
indication of the variance among reviewers, are reported.

         To report on the results of an alignment study of a state’s curriculum standards
and assessments for different grade levels, the study addresses specific criteria related to
the content agreement between the state standards and grade-level assessments. The four
alignment criteria receive major attention in the reports: categorical concurrence, depth-
of-knowledge consistency, range-of-knowledge correspondence, and balance of

                       Alignment Criteria Used for This Analysis

        The analysis, which judges the alignment between standards and assessments on
the basis of four criteria, also reports on the quality of assessment items by identifying
those items with sources of challenge and other issues. For each alignment criterion, an
acceptable level is defined by what would be required to assure that a student had met the

Categorical Concurrence

        An important aspect of alignment between standards and assessments is whether
both address the same content categories. The categorical-concurrence criterion provides
a very general indication of alignment if both documents incorporate the same content.
The criterion of categorical concurrence between standards and assessment is met if the
same or consistent categories of content appear in both documents. This criterion was
judged by determining whether the assessment included items measuring content from
each standard. The analysis assumed that the assessment had to have at least six items
measuring content from a standard in order for an acceptable level of categorical
concurrence to exist between the standard and the assessment. The number of items, six,
is based on estimating the number of items that could produce a reasonably reliable
subscale for estimating students’ mastery of content on that subscale. Of course, many
factors have to be considered in determining what a reasonable number is, including the
reliability of the subscale, the mean score, and cutoff score for determining mastery.
Using a procedure developed by Subkoviak (1988) and assuming that the cutoff score is
the mean and that the reliability of one item is .1, it was estimated that six items would
produce an agreement coefficient of at least .63. This indicates that about 63% of the
group would be consistently classified as masters or nonmasters if two equivalent test
administrations were employed. The agreement coefficient would increase if the cutoff
score is increased to one standard deviation from the mean to .77 and, with a cutoff score
of 1.5 standard deviations from the mean, to .88. Usually, states do not report student
results by standards, or require students to achieve a specified cutoff score on subscales
related to a standard. If a state did do this, then the state would seek a higher agreement
coefficient than .63. Six items were assumed as a minimum for an assessment measuring
content knowledge related to a standard and as a basis for making some decisions about
students’ knowledge of that standard. If the mean for six items is 3 and one standard
deviation is one item, then a cutoff score set at 4 would produce an agreement coefficient
of .77. Any fewer items with a mean of one-half of the items would require a cutoff that
would only allow a student to miss one item. This would be a very stringent requirement,
considering a reasonable standard error of measurement on the subscale.

Depth-of-Knowledge Consistency

        Standards and assessments can be aligned not only on the category of content
covered by each, but also on the basis of the complexity of knowledge required by each.
Depth-of-knowledge consistency between standards and assessment indicates alignment
if what is elicited from students on the assessment is as demanding cognitively as what
students are expected to know and do as stated in the standards. For consistency to exist
between the assessment and the standard, as judged in this analysis, at least 50% of the
items corresponding to an objective had to be at or above the level of knowledge of the
objective: 50%, a conservative cutoff point, is based on the assumption that a minimal
passing score for any one standard of 50% or higher would require the student to
successfully answer at least some items at or above the depth-of-knowledge level of the
corresponding objectives. For example, assume an assessment included six items related
to one standard and students were required to answer correctly four of those items to be
judged proficient—i.e., 67% of the items. If three, 50%, of the six items were at or above
the depth-of-knowledge level of the corresponding objectives, then for a student to
achieve a proficient score would require the student to answer correctly at least one item
at or above the depth-of-knowledge level of one objective. Some leeway was used in the
analysis on this criterion. If a standard had between 40% and 50% of items at or above
the depth-of-knowledge levels of the objectives, then it was reported that the criterion
was ―weakly‖ met.

       Interpreting and assigning depth-of-knowledge levels to both objectives within
standards and to assessment items is an essential requirement of alignment analysis.
These descriptions help to clarify what the different levels represent in, for example,
        Level 1 (Recall) includes the recall of information such as a fact, definition, term,
or a simple procedure, as well as performing a simple algorithm or applying a formula.
That is, in mathematics, a one-step, well-defined, and straight algorithmic procedure
should be included at this lowest level. Other key words that signify a Level 1 include
―identify,‖ ―recall,‖ ―recognize,‖ ―use,‖ and ―measure.‖ Verbs such as ―describe‖ and
―explain‖ could be classified at different levels, depending on what is to be described and

        Level 2 (Skill/Concept) includes the engagement of some mental processing
beyond a habitual response. A Level 2 assessment item requires students to make some
decisions as to how to approach the problem or activity, whereas Level 1 requires
students to demonstrate a rote response, perform a well-known algorithm, follow a set
procedure (like a recipe), or perform a clearly defined series of steps. Keywords that
generally distinguish a Level 2 item include ―classify,‖ ―organize,‖ ‖estimate,‖ ―make
observations,‖ ―collect and display data,‖ and ―compare data.‖ These actions imply more
than one step. For example, to compare data requires first identifying characteristics of
the objects or phenomenon and then grouping or ordering the objects. Some action verbs,
such as ―explain,‖ ―describe,‖ or ―interpret,‖ could be classified at different levels,
depending on the object of the action. For example, interpreting information from a
simple graph, or requiring the reading of information from the graph, also are at Level 2.
Interpreting information from a complex graph that requires some decisions on what
features of the graph need to be considered and how information from the graph can be
aggregated is at Level 3. Level 2 activities are not limited only to number skills, but can
involve visualization skills and probability skills. Other Level 2 activities include
noticing and describing non-trivial patterns; explaining the purpose and use of
experimental procedures; carrying out experimental procedures; making observations and
collecting data; classifying, organizing, and comparing data; and organizing and
displaying data in tables, graphs, and charts.

        Level 3 (Strategic Thinking) requires reasoning, planning, using evidence, and a
higher level of thinking than the previous two levels. In most instances, requiring
students to explain their thinking is a Level 3. Activities that require students to make
conjectures are also at this level. The cognitive demands at Level 3 are complex and
abstract. The complexity does not result from the fact that there are multiple answers, a
possibility for both Levels 1 and 2, but because the task requires more demanding
reasoning. An activity, however, that has more than one possible answer and requires
students to justify the response they give would most likely be a Level 3.
Other Level 3 activities include drawing conclusions from observations; citing evidence
and developing a logical argument for concepts; explaining phenomena in terms of
concepts; and using concepts to solve problems.

        Level 4 (Extended Thinking) requires complex reasoning, planning, developing,
and thinking most likely over an extended period of time. The extended time period is not
a distinguishing factor if the required work is only repetitive and does not require
applying significant conceptual understanding and higher-order thinking. For example, if
a student has to take the water temperature from a river each day for a month and then
construct a graph, this would be classified as a Level 2. However, if the student is to
conduct a river study that requires taking into consideration a number of variables, this
would be at Level 4. At Level 4, the cognitive demands of the task should be high and the
work should be very complex. Students should be required to make several
connections—relate ideas within the content area or among content areas—and would
have to select one approach among many alternatives on how the situation should be
solved, in order to be at this highest level. Level 4 activities include developing and
proving conjectures; designing and conducting experiments; making connections between
a finding and related concepts and phenomena; combining and synthesizing ideas into
new concepts; and critiquing experimental designs.

Range-of-Knowledge Correspondence

         For standards and assessments to be aligned, the breadth of knowledge required
on both should be comparable. The range-of-knowledge criterion is used to judge
whether a comparable span of knowledge expected of students by a standard is the same
as, or corresponds to, the span of knowledge that students need in order to correctly
answer the assessment items/activities. The criterion for correspondence between span of
knowledge for a standard and an assessment considers the number of objectives within
the standard with one related assessment item/activity. Fifty percent of the objectives for
a standard had to have at least one related assessment item in order for the alignment on
this criterion to be judged acceptable. This level is based on the assumption that students’
knowledge should be tested on content from over half of the domain of knowledge for a
standard. This assumes that each objective for a standard should be given equal weight.
Depending on the balance in the distribution of items and the necessity for having a low
number of items related to any one objective, the requirement that assessment items need
to be related to more than 50% of the objectives for a standard increases the likelihood
that students will have to demonstrate knowledge on more than one objective per
standard to achieve a minimal passing score. As with the other criteria, a state may
choose to make the acceptable level on this criterion more rigorous by requiring an
assessment to include items related to a greater number of the objectives. However, any
restriction on the number of items included on the test will place an upper limit on the
number of objectives that can be assessed. Range-of-knowledge correspondence is more
difficult to attain if the content expectations are partitioned among a greater number of
standards and a large number of objectives. If 50% or more of the objectives for a
standard had a corresponding assessment item, then the range-of-knowledge criterion was
met. If between 40% to 50% of the objectives for a standard had a corresponding
assessment item, the criterion was ―weakly‖ met.

Balance of Representation

       In addition to comparable depth and breadth of knowledge, aligned standards and
assessments require that knowledge be distributed equally in both. The range-of-
knowledge criterion only considers the number of objectives within a standard hit (a
standard with a corresponding item); it does not take into consideration how the hits (or
assessment items/activities) are distributed among these objectives. The balance-of-
representation criterion is used to indicate the degree to which one objective is given
more emphasis on the assessment than another. An index is used to judge the distribution
of assessment items. This index only considers the objectives for a standard that have at
least one hit—i.e., one related assessment item per objective. The index is computed by
considering the difference in the proportion of objectives and the proportion of hits
assigned to the objective. An index value of 1 signifies perfect balance and is obtained if
the hits (corresponding items) related to a standard are equally distributed among the
objectives for the given standard. Index values that approach 0 signify that a large
proportion of the hits are on only one or two of all of the objectives hit. Depending on the
number of objectives and the number of hits, a unimodal distribution (most items related
to one objective and only one item related to each of the remaining objectives) has an
index value of less than .5. A bimodal distribution has an index value of around .55 or .6.
Index values of .7 or higher indicate that items/activities are distributed among all of the
objectives at least to some degree (e.g., every objective has at least two items) and is used
as the acceptable level on this criterion. Index values between .6 and .7 indicate the
balance-of-representation criterion has only been ―weakly‖ met.


       The source-of-challenge criterion is only used to identify items on which the
major cognitive demand is inadvertently placed and is other than the targeted
mathematics skill, concept, or application. Cultural bias or specialized knowledge could
be reasons for an item to have a source-of-challenge problem. Such item characteristics
may result in some students not answering an assessment item, or answering an
assessment item incorrectly, or at a lower level, even though they possess the
understanding and skills being assessed.

                            Reporting the Alignment Results

               The reports of an alignment analysis generally are quite lengthy, 150
pages or more. In a report, the distribution of the depth-of-knowledge levels of the
objectives under each set of standards is summarized. This process provides some
information on the rigor of the standards and, across grades, on the increased level of
expectations. Then, the degree of alignment for each grade is described by each criterion
and the changes required to achieve acceptable alignment. Reporting by the five
alignment criteria will produce information that will indicate whether

           1. there are a sufficient number of items on a test for each strand,
           2. the items are at an appropriate level of complexity,
           3. a sufficient proportion of the standards under each strand is assessed,
           4. the degree of emphasis among the standards is appropriate within each
              strand, and
           5. there are any items that may have a source of challenge.
Reviewers comments are then reported, followed by a report of the intraclass correlation
of the assignment of the depth-of-knowledge level to each item for each analysis. The
narrative of the report concludes by summarizing the alignment results.

       The appendices to the report include detailed information by standard, objective,
and item for each analysis. Appendix A reports the DOK levels for each objective for all
standards. Appendix B includes 11 tables for each analysis:

      Summary of results for each of the four alignment criteria (four tables)
      Comments made by reviewers on items identified as having a source-of-challenge
       issue by item number.
      The depth-of-knowledge level (DOK) value for each assessment item given by
       each reviewer. The intraclass correlation for the group of reviewers is given on
       the last row.
      All notes made by reviewers on items by item number.
      The DOK level and objective code assigned by each reviewer for each item.
      Objectives coded to each item by reviewer
      Items coded by reviewers for each objective
      Number of reviewers coding an item by objective.

                                  Challenges and Issues

       In this section, I will identify some of the issues that have arisen in doing
alignment studies and address some of the basic principles of aligning content standards
and assessments.

Acceptable Level for Number of Items Per Standard

         The first issue is what number of items constitutes an adequate number to claim
that an assessment is aligned with a standard. The Webb alignment process uses six items
measuring content related to a standard as the acceptable level. This number, as discussed
above (page 5), was derived using a procedure developed by Subkoviak (1988) to
determine that reliability pertaining to judging a person’s mastery based on assessment
items. The WAT has a feature that allows people to vary the number used for an
acceptable level so process does have some flexibility. However, some situations have
come up that raises some questions about the six as the number. Table 1 reports the
findings from State A science alignment analysis for grade 3 for the Categorical
Concurrence criterion. Of the six science standards, three met the acceptable of level of
having six hits and three standards did not. The mean hits for standards are shown in
Table 1 along with the proportions of items for each standard as specified in the state test
blueprint. Clearly the state gives more emphasis to two of the standards, 3.2 (Inquiry) and
3.4 (Subject Matter and Concepts), and equal emphasis to the other four standards. This
is reflected by the distribution of items on the assessment. However, standard 3.5 (Design
and Applications) and standard 3.6 (Personal and Social) are more difficult to assess on
an on-demand assessment and were given less emphasis, even less than specified by the
test blueprint. The report indicated that the alignment was not acceptable because an
insufficient number of items for three of the grade 3 standards. At issue, is six items a
reasonable minimum or should adjustments be made in this acceptable level? If
adjustments are to be made then what should be the decision rule?

Table 1
State A
Categorical Concurrence for Grade 3 Science
(N = 55 items)
           Standards             Hits
                                         Cat. Concurr.
    Title (test blueprint %)  Mean S.D.
3.1 - History/Nature (8%)          1      0        NO
3.2 – Inquiry (30%)              17.38 2.12       YES
3.3 - Unifying Themes (8%)        7.5     4       YES
3.4 - Subj Matter/Conc (38%)      33.5 1.94       YES
3.5 - Design/Applic (8%)          2.12 1.27        NO
3.6 - Personal/Social (8%)        4.75 1.09        NO
             Total               66.25 5.78

Distribution of Items Related to a Standard by Depth-of-Knowledge Level

         A second issue regards the distribution of items on an assessment by the depth-of-
knowledge level. Is 50% of the items coded to a standard with a DOK at or above the
DOK of the corresponding objective appropriate as the minimal acceptable level? Table 2
and Figure 1 display the data for one state and one grade where this acceptable level was
met for four of the six standards. For Standard III only 42% of the over 13 items coded as
corresponding to that standard, on the average, had a depth-of-knowledge level that was
the same or above the DOK level of the corresponding objective. Since this is within 10%
of the acceptable level of 50% items with DOK levels at or above, it was judged that this
standard and assessment on weakly met the alignment criterion of Depth-of-Knowledge
Consistency. Thus, a student could answer 8 of the 13 items corresponding to Standard
III, generally a level sufficient to be declared proficient on a standard, without ever
answering a question with a DOK level that is at least as high as the corresponding
objective. Standard I and the assessment failed to be acceptable on the Depth-of-
Knowledge Consistency criterion because only 17% of the nearly 10 items corresponding
to that standard had a DOK level that was at least comparable to the DOK level of the
corresponding objective.

         The acceptable level for the DOK is based on the assumption that students with a
perceived minimal proficient score of 50% of the items as correct should have answered
at least one item with a DOK level that is at least the same level of complexity as the
corresponding content objective. However, what is considered as acceptable should
depend to some degree on the purpose of the assessment. If the purpose of the assessment
        is to differentiate between students who are proficient from students who are not, then an
        argument could be made that all or nearly all of the item DOK levels should be the same
        as the DOK levels of the corresponding objectives. However, if the purpose of the
        assessment is to place students on a range of proficiency levels (e.g. below basic, basic,
        proficient, and advanced) then it is reasonable to have items with a range of DOK levels
        in comparison to the corresponding objectives.

                Content standards and many objectives under content standards cover a broad
        range of content that students are expected to attain. Thus, the domain of items for
        measuring students’ knowledge related to an objective or standard can be very large and
        vary by complexity or depth-of-knowledge level. The alignment process devised by
        Webb has reviewers assign one DOK level to each objective. Reviewers, who are experts
        in the content area, are to assign a DOK level to an objective by judging what is the
        complexity of the most representative assessment items or content expressed by the
        objective. Realizing that many objectives cover a broad range of content, it may be
        reasonable to have a items with different DOK levels corresponding to the same
        objective, some below the DOK level of the objective, some at, and some above. The
        decision rule imposed in the alignment analysis discussed here is based on judging if
        students are proficient. Another decision rule could be based on having items that are
        more representative of the range of complexity in objectives and standards such as 20%
        with a DOK level of 1, 60% with a DOK level of 2, and 20% with a DOK level of 3. Or,
        the range of complexity could be decided on by certain percentage of items that are
        below, at, or above the corresponding objectives. The issue remains that there are
        different ways of considering what is an acceptable distribution of items by complexity
        that depends largely on the purpose of the assessment.

        Table 2
        State B
        Depth-of-Knowledge Consistency High School Mathematics
        (N = 51 items)

                Standards                                                                       DOK
                                              # Hits       % Under      % At      % Above
                    Title                       M             M          M           M
I - Patterns, Relationships and Functions     10.44           83         17           0           NO
II – Geometry and Measurement                   13            20         51          29           YES
III – Data Analysis and Statistics            13.44           58         40           2          WEAK
IV - Number Sense and Numeration               2.78           25         61          14           YES
V - Numerical and Algebraic Operations
                                              10.67           30         57          12              YES
and Analytical ...
VI - Probability and Discrete
                                               6.89           42         56           2              YES
                    Total                     57.22           43         47          11
Figure 1. Percent of items with DOK levels Under, At, and Above corresponding
objectives for each standard.

                                   State B DOK Consistency

                                                                                   % Above

                                                                                   % At
                  40                                                               % Under


























                                           Mathematics Standards

Breath in Content Coverage of a Standard

        A third issue is regards to what constitutes the appropriate breath of coverage for
a standard. The decision rule currently being used is for 50% or more of the objectives
under a standard to have at least one corresponding assessment item for a minimal
acceptable breath of coverage. The number of objectives under a standard is highly
related to the difficulty in meeting the range-of-knowledge correspondence (breath in
content). If a state lists a large number of objectives under a standard, then it is more
difficult for a state to meet an acceptable level on range-of-knowledge because of the
limited number of that can be used on an assessment. For example, State B (Table 3) in
mathematics had six standards. Each standard had from 9 to 18 objectives for a total of
77 objectives. The high school test had a total of 51 items. Except for standard V, all of
the other standards had from 17% to 38% of the objectives under the standard with at
least one corresponding item. This proportion of the objectives with a corresponding item
is well below the acceptable level of 50% of the objectives.

        Having an adequate breath of content on an assessment can be a trade off with the
length of the assessment. An assessment with fewer items will have more difficulty
assessing at least partially all of the objectives. Other factors come into play when
considering breath. Some standards may have a larger number of objectives because the
standard covers more content. For example, for state B as depicted in Table 3 and Figure
2, Standard II (Geometry and Measurement) has more objectives than Standard V
(Numerical and Algebraic Operations), 18 objectives compared to 9 objectives. This
suggests that the content under Standard II has been partitioned in more ways than
content under Standard V. It could be that the objectives under geometry and
measurement are more specific or it could be that the state considered that geometry and
measurement had more content to cover. Another factor is that some of the objectives
under Standard II may be more difficult to assess on an on-demand assessment,
particularly if one item only measures content related to one objective. An on-demand
assessment could cover more content by including more robust items that measure
content associated with more than one objective or standard.

Table 3
State B
Range-of-Knowledge Correspondence
High School Mathematics (N = 51 items)

                                       # Hits   # Objs Hit    % Objs Hit Rng. of Know.
                          Goals Objs
          Title                         Mean        Mean         Mean
                           #     #
I - Patterns,
Relationships and          2      11    10.44        4.22          38             NO
II - Geometry and
                           3      18      13         5.78          32             NO
III - Data Analysis and
                           3      14    13.44         5            35             NO
IV - Number Sense
                           3      14     2.78        2.44          17             NO
and Numeration
V - Numerical and
Algebraic Operations       2      9     10.67        5.22          55            YES
and Analytical ...
VI - Probability and
                           2      11     6.89        3.67          33             NO
Discrete Mathematics
          Total            15     77    57.22        4.39          35

       The current decision rule of 50% of the objectives with at least one hit clearly is a
very minimal requirement for alignment. A number of factors could be considered in
judging the adequate range of content including the breath of content covered by a
standard, the length of the assessment, the suitability of the content to be assessed on an
on-demand assessment, and the difference in importance of different objectives under a
standard. Considering this and other factors then could be used to develop over decision
rules such as randomly sampling objectives under a standard, setting a minimum number
of objectives under a standard to have a hit, or differentiating the importance of some
standards from others by requiring more objectives under the most important standards
being assessed than the least important standards. As with the other issues, there are
multiple of considerations that need to be given to judging the adequacy of the alignment
between an assessment and set of standards.

Figure 2. Percent of objectives with one or more hits for each standard for State B high
school mathematics.

                             State B Range of Knowledge
































                                                    Mathematics Standards

Degree of Emphasis Given Some Objectives

        It is reasonable to have some standards be more important than other standards
and for some objectives under a standard to be more important than other objectives. The
Balance of Representation alignment criterion, however, assumes that items should be
fairly evenly distributed among the objectives under a standard. An index is used to
depict balance:
                        BALANCE INDEX 1 – (∑ │1/(O) – I (k) /(H )│)/2

                                 Where O = Total number of objectives hit for the standard
                                       I (k) = Number of items hit corresponding to objective (k)
                                       H = Total number of items hit for the standard
Table 4 shows the index value for three language arts standards for an analysis for one
state. Figure 3 gives a pictorial representation that distribution of hits (test items) that
were coded as corresponding to the different objectives for Standards I and II. For both of
these standards, reviewers coded a large number of hits (over 250) as corresponding to
one objective each standard. This resulted in index values of .57 and .68 that are below
the acceptable level used of .70 (Table 4). The large number of hits is related to having
writing sample that has a weighting of 12 compared to most items with a weighting of 1.

        At issue with balance is the degree that the amount of emphasis given to different
objectives under a standard should vary. It is possible for state to accept a lower balance
index value than .70. So State B could be satisfied with a balance index value for
Standard 2 of .68 (Table 4) with a large emphasis on objective 2.4 (Figure 3). However,
sometimes one objective is emphasized more than other objectives because it is easier to
write assessment items for some objectives compared to other objectives. The main issue
to resolve is how should alignment analyses consider the difference in emphasis by
objectives. This issue relates to how the assessment blueprint differentiates among
objectives and whether or not it is appropriate to have large variations among objectives.

Table 4
State B Balance of Representation
High School Language Arts (3 of 12 standards) (N = 116 items)

                                                Balance Index
                                              % Hits in              Bal. of
                                               Std/Ttl   Index      Represent.
                                  Goals Objs
                      Title                  Mean S.D. Mean S.D.
                                   #     #
                I. - Meaning and
                Communication— 1 5.11 28            8   0.57 0.12      NO
                II. - Meaning and
                Communication— 1       4   48       7   0.68 0.14 WEAK
                VIII. - Genre and
                                   1   5   17       6   0.63 0.16 WEAK
                Craft of Language
                       Total      12 55.33 8        18 0.36 0.21
Figure 3. Number of hits for each objective under two standards for State B high school
language arts.

                                   State B Balance of Representation
   Language Arts Standards

                                   0   50   100   150   200   250   300   350   400
                                                  Number of Hits

Change in Depth-of-Knowledge Level Across Grades

         The final issue to be discussed is the change in complexity of content across grade
levels. It is reasonable to expect that has students proceed through the grades that they
will be expected to do more reasoning and analysis and less simple recall and
recognition. This was the case for state A in mathematics (Figure 4) and in language arts
(Figure 5). For both mathematics and language arts the percent of objectives with a
depth-of-knowledge level of 1 (recall and recognition) decreased while the percent of
objectives with a depth-of-knowledge level 3 (strategic reasoning) increased from grade 3
to grade 10 (Figures 4 and 5). However, DOK levels are dependent somewhat on grade
level and what a typical student at a grade level can be expected to know and do.
Reviewers in an alignment analysis developed by Webb are instructed to think about
what a typical student should be expected to know and do in assigning what are the DOK
levels of the content objectives. In reading the increase in complexity across grades may
be due to having more sophisticated passages while the actual behavior or cognitive
requirements stay relatively constant, such as determine the main idea. However, if along
with more sophisticated passages students are expected to do more with drawing
inferences or paraphrasing then the DOK levels may increase across grades. Currently
there are really no fixed guidelines as to what is an acceptable progression in content
complexity from grade to grade. In the absence of such guidelines, the progression of
content complexity depicted in the Figures 4 and 5 for State A seem to be reasonable.
The work of Wise and Alt (2005) on vertical alignment will help inform the field on this
Figure 4. State A mathematics DOK levels for objectives by grade.

      Percent of DOK Levels


                                                                      DOK Level 4
                                                                      DOK Level 3
                                                                      DOK Level 2
                                                                      DOK Level 1


                                     3   4   5     6     7   8   10

Figure 5. State A reading language arts DOK levels for objectives by grade.

      Percent of DOK Levels


                                                                      DOK Level 4
                                                                      DOK Level 3
                                                                      DOK Level 2
                                                                      DOK Level 1


                                     3   4   5     6     7   8   10


         In this paper, I have described a process that has been used to analyze the
agreement between state academic content standards and state assessments. The Webb
Alignment Process was developed for the National Institute for Science Education
(NISE) and the Council of Chief State School Officers in 1997 and has evolved over
time. A web-based tool is now available for aiding in conducting the process and
analyzing the results. The process produces information about the relationship between a
set of standards and an assessment by reporting on four main alignment criteria—
categorical concurrence, depth-of-knowledge consistency, range-of-knowledge
correspondence, and balance of representation.

       Five alignment issues were discussed. Each of these issues is related to one or
more of the alignment criteria. These issues center around the basic question of what
alignment is good enough. Specific rationale described in this paper has been used to set
acceptable levels for each of the four alignment criteria. These acceptable levels have
been specified for primarily pragmatic reasons such as assumptions on what would be
considered as a passing score, the number of items needed to make some decisions on
student learning, and relative low number of items that can be included on an on-demand
assessment. The issues discussed arise from a change in these underlying assumptions
and considering variations in the purpose of an assessment. The issues themselves are not
resolved in this paper, nor was it the intent of the paper to do so even if they could be.
The existence of these issues and other related issues just point to the fact that judging the
alignment among standards and assessments requires using some subjectivity and cannot
be solely based on a clear set of objective rules. This makes it critical for any alignment
analysis to make clear what are the underlying assumptions and how conclusions are

Blank, R. (2002). Models for alignment analysis and assistance to states. Council of
       Chief State School Officers summary document. Washington, D.C.: CCSSO.

Porter, A. C. (2002, October). Measuring the content of instruction: Uses in
        research and practice. Educational Researcher, 31(7), 3–14.

Smith, M. S., & O’Day, J. A. Systemic school reform. In S. H. Fuhrman & B. Malen
       (Eds.), The politics of curriculum and testing (pp. 233–267). Bristol, PA: Falmer.

Subkoviak, M. J. (1988). A practitioner’s guide to computation and interpretation of
      reliability indices for mastery tests. Journal of Educational Measurement, 25(1),

Webb, N. L. (1997). Criteria for alignment of expectations and assessments in
      mathematics and science education. Council of Chief State School Officers and
      National Institute for Science Education Research Monograph No. 6. Madison:
      University of Wisconsin, Wisconsin Center for Education Research.

Webb, N. L. (2002). Alignment study in language arts, mathematics, science, and social
      studies of state standards and assessments for four states. Washington, DC:
      Council of Chief State School Officers.

Wise, L. L. & Alt, M. (2205). Assessing vertical alignment. Paper prepared for Council
       of Chief State school Officers. Alexandria, VA: Human Resources Research
       Organization, DFR 05-19.

To top