DRAFT Issues Related to Judging the Alignment of Curriculum Standards and Assessments Norman L. Webb Wisconsin Center for Education Research University of Wisconsin–Madison Annual Meeting of the American Educational Research Association Meeting, Montreal, April 11, 2005 This work was supported by a subgrant from the U. S. Department of Education (S368A030011) to the State of Oklahoma and a grant from the National Science Foundation, (EHR 0233445) to the University of Wisconsin–Madison. Any opinions, findings, or conclusions are those of the author and do not necessarily reflect the view of the supporting agencies. Issues Related to Judging the Alignment of Curriculum Standards and Assessments Norman L. Webb Introduction Alignment among policy documents, curriculum materials, and instructional practice has received increased importance over the past 10 to 15 years. In the early 1990s, a major tenet of the efforts toward systemic reform was to have the system components aligned with one another (Smith & O’Day, 1991). The Title I reauthorization legislation included the requirement that states use assessments aligned with curriculum standards, a requirement very much attuned to the theory of systemic reform, Elementary and Secondary Education Act, of 1965 (ESEA)—the Improving Americas School Act, of 1994 (IASA). Continuing with the same principle, the No Child Left Behind Act of 2001 made assessment in reading and mathematics more explicit and built in the requirement that states would have to indicate that their assessments in grades 3 through 8 and once during high school are aligned with challenging academic content standards. Aware of the increasing importance of alignment, Webb (1997) wrote a monograph on the criteria for judging alignment for the National Institute for Science Education, encouraged by Andrew Porter, who was the principle investigator for the institute. This monograph discussed in some detail methods states and other jurisdictions used to determine alignment and the criteria that can to used to evaluate the alignment of a system. The monograph was written as one document related to the study of the evaluations of systemic reform motivated by the National Science Foundation’s systemic reform program and in close cooperation with the Council of Chief State School Officers (CCSSO). The monograph describes in detail criteria that can be used to judge the alignment between standards and assessments within an educational system. Five major alignment criteria developed by Webb include: content focus, pedagogical implications, equity, articulation across grades and ages, and system applicability. During the mid-1990s, CCSSO devoted significant effort to analyzing state standards and was interested in a process for analyzing agreement between state standards and assessments. In cooperation with CCSSO, Webb then developed a process for doing a content analysis for judging the alignment between standards and assessments. This content analysis used four of six criteria identified under major content focus criteria described in the alignment monograph—categorical concurrence, depth-of- knowledge correspondence, range-of-knowledge consistency, and balance of representation. In 1998, the newly developed alignment process was used for the first time to analyze, with the cooperation of CCSSO, the alignment of curriculum standards and assessment of four states. Four to five reviewers coded the depth-of-knowledge (DOK) levels of standards and the assessment items using paper-and-pencil forms. These data were hand-entered into an Excel file and then analyzed using procedures developed with the help of John Smithson. Over the next two years, the alignment process was refined and used to conduct alignment analyses in additional states. The definitions for the depth-of-knowledge (DOK) levels for four content areas (reading and writing, mathematics, science, and social studies) were written and refined after each analysis. Another monograph on the alignment process was published by CCSSO in 2002 (Webb, 2002). In 2003, the state of Oklahoma, with the cooperation of CCSSO and the Technical Issues for Large Scale Assessment (TILSA) collaborative, received a grant from the United States Department of Education to develop an electronic tool on a CD that could be used to do the alignment analysis. The work on the electronic tool was begun with the support of a grant from the National Science Foundation in 2002 for the purpose of providing technical assistance to the initiative to create Mathematics and Science Partnerships among K–12 school districts and institutions of higher education. The major work on the Web Alignment Tool (WAT) began in 2003. The alpha test for the WAT was conducted in Delaware in August, 2003, by analyzing standards and assessments from three states—Delaware English Language Arts (grades 3, 5, 8, and 10), mathematics (3 and 8), and science (4, 6, 8, and 11); South Carolina English Language Arts (grade 10) and science (high school biology); and Oklahoma (mathematics grade 8 and Algebra I) and science (high school biology). The on-line beta test of the tool was conducted in Delaware in September, 2002, for mathematics grades 5 and 10. The beta test of the CD version of the WAT was conducted in Alabama in January, 2004, for mathematics 3, 5, 7, and 9. In 2004, the on-line WAT was used to conduct additional analyses for four states. Currently, the WAT exists as an on-line tool (http://www.wcer.wisc.edu/WAT), and on a CD. One dissemination conference on how to use the alignment tools has been conducted for states west of the Mississippi on February 28 and March 1, 2005, in Phoenix. A second dissemination conference is to be conducted for states east of the Mississippi in Boston on July 25 and 26, 2005. The Webb alignment process is one of a handful of other processes (Blank, 2002). Porter and Smithson (Porter, 2002) developed a process referred to as the Survey of the Enacted Curriculum (SEC). Central to this process is a content-by-cognitive level matrix. Reviewers systematically categorize standards, assessments, curriculum, or instructional practices onto the matrix indicating the degree of emphasis in each cell. Comparisons, or the degree of alignment, are made by considering the amount of overlap of cells on the matrix between any two elements of the analysis (assessment and standards, curriculum and standards, standards and instruction, etc.). Achieve, Inc., has developed another process that is based on a group of experts reaching consensus on the degree to which the assessment-by-standard mapping conducted by a state or district is valid. This process reports on five criteria: Content Centrality, Performance Centrality, Source of Challenge, Balance, and Range. For Content Centrality and Performance Centrality, reviewers reach a consensus as to whether the item and the intent objective(s) correspond fully, partially, or not at all. Achieve prepares an extensive narrative to describe the results from the review and will include a ―policy audit‖ of standards and the assessment system if desired. Webb Alignment Process Generally, the alignment process is performed during a three-day Alignment Analysis Institute. The length of the institute is dependent on the number of grades to be analyzed, the length of the standards, the length of the assessments, and the number of assessment forms. Five to eight reviewers generally do each analysis. The larger number of reviewers will increase the reliability of the results. Reviewers should be content-area experts, district content-area supervisors, and content-area teachers. To standardize the language, the process employs the convention of standards, goals, and objectives to describe three levels of expectations for what students are to know and do. Standard is used here as the most general (for instance, Data Analysis and Statistics). A standard, most of the time, will be comprised of a specific number of goals, which are comprised in turn of a specific number of objectives. Generally, but not always, there is an assumption that the objectives are intended to span the content of the goals and standards under which they fall. Reviewers are trained to identify the depth-of-knowledge of objectives and assessment items. This training includes reviewing the definitions of the four depth-of- knowledge (DOK) levels and then reviewing examples of each. Then the reviewers participate in 1) a consensus process to determine the depth-of-knowledge levels of the state’s objectives and 2) individual analyses of the assessment items of each of the assessments. Following individual analyses of the items, reviewers participate in a debriefing discussion in which they give their overall impressions of the alignment between the assessment and the state’s curriculum standards. To derive the results on the degree of agreement between the state’s standards and each assessment, the reviewers’ responses are averaged. Any variance among reviewers is considered legitimate, with the true depth-of-knowledge level for the item falling somewhere in between two or more assigned values. Such variation could signify a lack of clarity in how the objectives were written, the robustness of an item that can legitimately correspond to more than one objective, and/or a depth of knowledge that falls in between two of the four defined levels. Reviewers were allowed to identify one assessment item as corresponding to up to three objectives—one primary hit (objective) and up to two secondary hits. However, reviewers can only code one depth-of-knowledge level to each assessment item, even if the item corresponded to more than one objective. Finally, in addition to learning the process, reviewers are asked to provide suggestions for improving the process. Reviewers are instructed to focus primarily on the alignment between the state standards and the various assessments. However, they are encouraged to offer their opinions on the quality of the standards, or of the assessment activities/items, by writing a note about the items. Reviewers can also indicate whether there is a source-of-challenge issue with the item—i.e., a problem with the item that might cause the student who knows the material to give a wrong answer, or enable someone who does not have the knowledge being tested to answer the item correctly. For example, a mathematics item that involves an excessive amount of reading may represent a source-of-challenge issue because the skill required to answer is more a reading skill than a mathematics skill. Source-of-challenge can be considered as a fifth alignment criteria in the analysis and was originally so defined by the Achieve, Inc. The results produced from the institute pertain only to the issue of agreement between the state standards and the assessment instruments. Thus, the alignment analysis does not serve as external verification of the general quality of a state’s standards or assessments. Rather, only the degree of alignment is discussed in the results. The averages of the reviewers’ coding are used to determine whether the alignment criteria are met. When reviewers do vary in their judgments, the averages lessened the error that might result from any one reviewer’s finding. Standard deviations, which give one indication of the variance among reviewers, are reported. To report on the results of an alignment study of a state’s curriculum standards and assessments for different grade levels, the study addresses specific criteria related to the content agreement between the state standards and grade-level assessments. The four alignment criteria receive major attention in the reports: categorical concurrence, depth- of-knowledge consistency, range-of-knowledge correspondence, and balance of representation. Alignment Criteria Used for This Analysis The analysis, which judges the alignment between standards and assessments on the basis of four criteria, also reports on the quality of assessment items by identifying those items with sources of challenge and other issues. For each alignment criterion, an acceptable level is defined by what would be required to assure that a student had met the standards. Categorical Concurrence An important aspect of alignment between standards and assessments is whether both address the same content categories. The categorical-concurrence criterion provides a very general indication of alignment if both documents incorporate the same content. The criterion of categorical concurrence between standards and assessment is met if the same or consistent categories of content appear in both documents. This criterion was judged by determining whether the assessment included items measuring content from each standard. The analysis assumed that the assessment had to have at least six items measuring content from a standard in order for an acceptable level of categorical concurrence to exist between the standard and the assessment. The number of items, six, is based on estimating the number of items that could produce a reasonably reliable subscale for estimating students’ mastery of content on that subscale. Of course, many factors have to be considered in determining what a reasonable number is, including the reliability of the subscale, the mean score, and cutoff score for determining mastery. Using a procedure developed by Subkoviak (1988) and assuming that the cutoff score is the mean and that the reliability of one item is .1, it was estimated that six items would produce an agreement coefficient of at least .63. This indicates that about 63% of the group would be consistently classified as masters or nonmasters if two equivalent test administrations were employed. The agreement coefficient would increase if the cutoff score is increased to one standard deviation from the mean to .77 and, with a cutoff score of 1.5 standard deviations from the mean, to .88. Usually, states do not report student results by standards, or require students to achieve a specified cutoff score on subscales related to a standard. If a state did do this, then the state would seek a higher agreement coefficient than .63. Six items were assumed as a minimum for an assessment measuring content knowledge related to a standard and as a basis for making some decisions about students’ knowledge of that standard. If the mean for six items is 3 and one standard deviation is one item, then a cutoff score set at 4 would produce an agreement coefficient of .77. Any fewer items with a mean of one-half of the items would require a cutoff that would only allow a student to miss one item. This would be a very stringent requirement, considering a reasonable standard error of measurement on the subscale. Depth-of-Knowledge Consistency Standards and assessments can be aligned not only on the category of content covered by each, but also on the basis of the complexity of knowledge required by each. Depth-of-knowledge consistency between standards and assessment indicates alignment if what is elicited from students on the assessment is as demanding cognitively as what students are expected to know and do as stated in the standards. For consistency to exist between the assessment and the standard, as judged in this analysis, at least 50% of the items corresponding to an objective had to be at or above the level of knowledge of the objective: 50%, a conservative cutoff point, is based on the assumption that a minimal passing score for any one standard of 50% or higher would require the student to successfully answer at least some items at or above the depth-of-knowledge level of the corresponding objectives. For example, assume an assessment included six items related to one standard and students were required to answer correctly four of those items to be judged proficient—i.e., 67% of the items. If three, 50%, of the six items were at or above the depth-of-knowledge level of the corresponding objectives, then for a student to achieve a proficient score would require the student to answer correctly at least one item at or above the depth-of-knowledge level of one objective. Some leeway was used in the analysis on this criterion. If a standard had between 40% and 50% of items at or above the depth-of-knowledge levels of the objectives, then it was reported that the criterion was ―weakly‖ met. Interpreting and assigning depth-of-knowledge levels to both objectives within standards and to assessment items is an essential requirement of alignment analysis. These descriptions help to clarify what the different levels represent in, for example, mathematics: Level 1 (Recall) includes the recall of information such as a fact, definition, term, or a simple procedure, as well as performing a simple algorithm or applying a formula. That is, in mathematics, a one-step, well-defined, and straight algorithmic procedure should be included at this lowest level. Other key words that signify a Level 1 include ―identify,‖ ―recall,‖ ―recognize,‖ ―use,‖ and ―measure.‖ Verbs such as ―describe‖ and ―explain‖ could be classified at different levels, depending on what is to be described and explained. Level 2 (Skill/Concept) includes the engagement of some mental processing beyond a habitual response. A Level 2 assessment item requires students to make some decisions as to how to approach the problem or activity, whereas Level 1 requires students to demonstrate a rote response, perform a well-known algorithm, follow a set procedure (like a recipe), or perform a clearly defined series of steps. Keywords that generally distinguish a Level 2 item include ―classify,‖ ―organize,‖ ‖estimate,‖ ―make observations,‖ ―collect and display data,‖ and ―compare data.‖ These actions imply more than one step. For example, to compare data requires first identifying characteristics of the objects or phenomenon and then grouping or ordering the objects. Some action verbs, such as ―explain,‖ ―describe,‖ or ―interpret,‖ could be classified at different levels, depending on the object of the action. For example, interpreting information from a simple graph, or requiring the reading of information from the graph, also are at Level 2. Interpreting information from a complex graph that requires some decisions on what features of the graph need to be considered and how information from the graph can be aggregated is at Level 3. Level 2 activities are not limited only to number skills, but can involve visualization skills and probability skills. Other Level 2 activities include noticing and describing non-trivial patterns; explaining the purpose and use of experimental procedures; carrying out experimental procedures; making observations and collecting data; classifying, organizing, and comparing data; and organizing and displaying data in tables, graphs, and charts. Level 3 (Strategic Thinking) requires reasoning, planning, using evidence, and a higher level of thinking than the previous two levels. In most instances, requiring students to explain their thinking is a Level 3. Activities that require students to make conjectures are also at this level. The cognitive demands at Level 3 are complex and abstract. The complexity does not result from the fact that there are multiple answers, a possibility for both Levels 1 and 2, but because the task requires more demanding reasoning. An activity, however, that has more than one possible answer and requires students to justify the response they give would most likely be a Level 3. Other Level 3 activities include drawing conclusions from observations; citing evidence and developing a logical argument for concepts; explaining phenomena in terms of concepts; and using concepts to solve problems. Level 4 (Extended Thinking) requires complex reasoning, planning, developing, and thinking most likely over an extended period of time. The extended time period is not a distinguishing factor if the required work is only repetitive and does not require applying significant conceptual understanding and higher-order thinking. For example, if a student has to take the water temperature from a river each day for a month and then construct a graph, this would be classified as a Level 2. However, if the student is to conduct a river study that requires taking into consideration a number of variables, this would be at Level 4. At Level 4, the cognitive demands of the task should be high and the work should be very complex. Students should be required to make several connections—relate ideas within the content area or among content areas—and would have to select one approach among many alternatives on how the situation should be solved, in order to be at this highest level. Level 4 activities include developing and proving conjectures; designing and conducting experiments; making connections between a finding and related concepts and phenomena; combining and synthesizing ideas into new concepts; and critiquing experimental designs. Range-of-Knowledge Correspondence For standards and assessments to be aligned, the breadth of knowledge required on both should be comparable. The range-of-knowledge criterion is used to judge whether a comparable span of knowledge expected of students by a standard is the same as, or corresponds to, the span of knowledge that students need in order to correctly answer the assessment items/activities. The criterion for correspondence between span of knowledge for a standard and an assessment considers the number of objectives within the standard with one related assessment item/activity. Fifty percent of the objectives for a standard had to have at least one related assessment item in order for the alignment on this criterion to be judged acceptable. This level is based on the assumption that students’ knowledge should be tested on content from over half of the domain of knowledge for a standard. This assumes that each objective for a standard should be given equal weight. Depending on the balance in the distribution of items and the necessity for having a low number of items related to any one objective, the requirement that assessment items need to be related to more than 50% of the objectives for a standard increases the likelihood that students will have to demonstrate knowledge on more than one objective per standard to achieve a minimal passing score. As with the other criteria, a state may choose to make the acceptable level on this criterion more rigorous by requiring an assessment to include items related to a greater number of the objectives. However, any restriction on the number of items included on the test will place an upper limit on the number of objectives that can be assessed. Range-of-knowledge correspondence is more difficult to attain if the content expectations are partitioned among a greater number of standards and a large number of objectives. If 50% or more of the objectives for a standard had a corresponding assessment item, then the range-of-knowledge criterion was met. If between 40% to 50% of the objectives for a standard had a corresponding assessment item, the criterion was ―weakly‖ met. Balance of Representation In addition to comparable depth and breadth of knowledge, aligned standards and assessments require that knowledge be distributed equally in both. The range-of- knowledge criterion only considers the number of objectives within a standard hit (a standard with a corresponding item); it does not take into consideration how the hits (or assessment items/activities) are distributed among these objectives. The balance-of- representation criterion is used to indicate the degree to which one objective is given more emphasis on the assessment than another. An index is used to judge the distribution of assessment items. This index only considers the objectives for a standard that have at least one hit—i.e., one related assessment item per objective. The index is computed by considering the difference in the proportion of objectives and the proportion of hits assigned to the objective. An index value of 1 signifies perfect balance and is obtained if the hits (corresponding items) related to a standard are equally distributed among the objectives for the given standard. Index values that approach 0 signify that a large proportion of the hits are on only one or two of all of the objectives hit. Depending on the number of objectives and the number of hits, a unimodal distribution (most items related to one objective and only one item related to each of the remaining objectives) has an index value of less than .5. A bimodal distribution has an index value of around .55 or .6. Index values of .7 or higher indicate that items/activities are distributed among all of the objectives at least to some degree (e.g., every objective has at least two items) and is used as the acceptable level on this criterion. Index values between .6 and .7 indicate the balance-of-representation criterion has only been ―weakly‖ met. Source-of-Challenge The source-of-challenge criterion is only used to identify items on which the major cognitive demand is inadvertently placed and is other than the targeted mathematics skill, concept, or application. Cultural bias or specialized knowledge could be reasons for an item to have a source-of-challenge problem. Such item characteristics may result in some students not answering an assessment item, or answering an assessment item incorrectly, or at a lower level, even though they possess the understanding and skills being assessed. Reporting the Alignment Results The reports of an alignment analysis generally are quite lengthy, 150 pages or more. In a report, the distribution of the depth-of-knowledge levels of the objectives under each set of standards is summarized. This process provides some information on the rigor of the standards and, across grades, on the increased level of expectations. Then, the degree of alignment for each grade is described by each criterion and the changes required to achieve acceptable alignment. Reporting by the five alignment criteria will produce information that will indicate whether 1. there are a sufficient number of items on a test for each strand, 2. the items are at an appropriate level of complexity, 3. a sufficient proportion of the standards under each strand is assessed, 4. the degree of emphasis among the standards is appropriate within each strand, and 5. there are any items that may have a source of challenge. Reviewers comments are then reported, followed by a report of the intraclass correlation of the assignment of the depth-of-knowledge level to each item for each analysis. The narrative of the report concludes by summarizing the alignment results. The appendices to the report include detailed information by standard, objective, and item for each analysis. Appendix A reports the DOK levels for each objective for all standards. Appendix B includes 11 tables for each analysis: Summary of results for each of the four alignment criteria (four tables) Comments made by reviewers on items identified as having a source-of-challenge issue by item number. The depth-of-knowledge level (DOK) value for each assessment item given by each reviewer. The intraclass correlation for the group of reviewers is given on the last row. All notes made by reviewers on items by item number. The DOK level and objective code assigned by each reviewer for each item. Objectives coded to each item by reviewer Items coded by reviewers for each objective Number of reviewers coding an item by objective. Challenges and Issues In this section, I will identify some of the issues that have arisen in doing alignment studies and address some of the basic principles of aligning content standards and assessments. Acceptable Level for Number of Items Per Standard The first issue is what number of items constitutes an adequate number to claim that an assessment is aligned with a standard. The Webb alignment process uses six items measuring content related to a standard as the acceptable level. This number, as discussed above (page 5), was derived using a procedure developed by Subkoviak (1988) to determine that reliability pertaining to judging a person’s mastery based on assessment items. The WAT has a feature that allows people to vary the number used for an acceptable level so process does have some flexibility. However, some situations have come up that raises some questions about the six as the number. Table 1 reports the findings from State A science alignment analysis for grade 3 for the Categorical Concurrence criterion. Of the six science standards, three met the acceptable of level of having six hits and three standards did not. The mean hits for standards are shown in Table 1 along with the proportions of items for each standard as specified in the state test blueprint. Clearly the state gives more emphasis to two of the standards, 3.2 (Inquiry) and 3.4 (Subject Matter and Concepts), and equal emphasis to the other four standards. This is reflected by the distribution of items on the assessment. However, standard 3.5 (Design and Applications) and standard 3.6 (Personal and Social) are more difficult to assess on an on-demand assessment and were given less emphasis, even less than specified by the test blueprint. The report indicated that the alignment was not acceptable because an insufficient number of items for three of the grade 3 standards. At issue, is six items a reasonable minimum or should adjustments be made in this acceptable level? If adjustments are to be made then what should be the decision rule? Table 1 State A Categorical Concurrence for Grade 3 Science (N = 55 items) Standards Hits Cat. Concurr. Title (test blueprint %) Mean S.D. 3.1 - History/Nature (8%) 1 0 NO 3.2 – Inquiry (30%) 17.38 2.12 YES 3.3 - Unifying Themes (8%) 7.5 4 YES 3.4 - Subj Matter/Conc (38%) 33.5 1.94 YES 3.5 - Design/Applic (8%) 2.12 1.27 NO 3.6 - Personal/Social (8%) 4.75 1.09 NO Total 66.25 5.78 Distribution of Items Related to a Standard by Depth-of-Knowledge Level A second issue regards the distribution of items on an assessment by the depth-of- knowledge level. Is 50% of the items coded to a standard with a DOK at or above the DOK of the corresponding objective appropriate as the minimal acceptable level? Table 2 and Figure 1 display the data for one state and one grade where this acceptable level was met for four of the six standards. For Standard III only 42% of the over 13 items coded as corresponding to that standard, on the average, had a depth-of-knowledge level that was the same or above the DOK level of the corresponding objective. Since this is within 10% of the acceptable level of 50% items with DOK levels at or above, it was judged that this standard and assessment on weakly met the alignment criterion of Depth-of-Knowledge Consistency. Thus, a student could answer 8 of the 13 items corresponding to Standard III, generally a level sufficient to be declared proficient on a standard, without ever answering a question with a DOK level that is at least as high as the corresponding objective. Standard I and the assessment failed to be acceptable on the Depth-of- Knowledge Consistency criterion because only 17% of the nearly 10 items corresponding to that standard had a DOK level that was at least comparable to the DOK level of the corresponding objective. The acceptable level for the DOK is based on the assumption that students with a perceived minimal proficient score of 50% of the items as correct should have answered at least one item with a DOK level that is at least the same level of complexity as the corresponding content objective. However, what is considered as acceptable should depend to some degree on the purpose of the assessment. If the purpose of the assessment is to differentiate between students who are proficient from students who are not, then an argument could be made that all or nearly all of the item DOK levels should be the same as the DOK levels of the corresponding objectives. However, if the purpose of the assessment is to place students on a range of proficiency levels (e.g. below basic, basic, proficient, and advanced) then it is reasonable to have items with a range of DOK levels in comparison to the corresponding objectives. Content standards and many objectives under content standards cover a broad range of content that students are expected to attain. Thus, the domain of items for measuring students’ knowledge related to an objective or standard can be very large and vary by complexity or depth-of-knowledge level. The alignment process devised by Webb has reviewers assign one DOK level to each objective. Reviewers, who are experts in the content area, are to assign a DOK level to an objective by judging what is the complexity of the most representative assessment items or content expressed by the objective. Realizing that many objectives cover a broad range of content, it may be reasonable to have a items with different DOK levels corresponding to the same objective, some below the DOK level of the objective, some at, and some above. The decision rule imposed in the alignment analysis discussed here is based on judging if students are proficient. Another decision rule could be based on having items that are more representative of the range of complexity in objectives and standards such as 20% with a DOK level of 1, 60% with a DOK level of 2, and 20% with a DOK level of 3. Or, the range of complexity could be decided on by certain percentage of items that are below, at, or above the corresponding objectives. The issue remains that there are different ways of considering what is an acceptable distribution of items by complexity that depends largely on the purpose of the assessment. Table 2 State B Depth-of-Knowledge Consistency High School Mathematics (N = 51 items) Standards DOK # Hits % Under % At % Above Consistency Title M M M M I - Patterns, Relationships and Functions 10.44 83 17 0 NO II – Geometry and Measurement 13 20 51 29 YES III – Data Analysis and Statistics 13.44 58 40 2 WEAK IV - Number Sense and Numeration 2.78 25 61 14 YES V - Numerical and Algebraic Operations 10.67 30 57 12 YES and Analytical ... VI - Probability and Discrete 6.89 42 56 2 YES Mathematics Total 57.22 43 47 11 Figure 1. Percent of items with DOK levels Under, At, and Above corresponding objectives for each standard. State B DOK Consistency 100 80 % Above Percent 60 % At 40 % Under 20 0 s s s r. se e t ea cn ta et pe en /F /S cr /M O rS is s is m rn g /D ys be eo Al te al b um t G ro & Pa An -P - um -N II I- a at VI -N IV D V – Mathematics Standards III Breath in Content Coverage of a Standard A third issue is regards to what constitutes the appropriate breath of coverage for a standard. The decision rule currently being used is for 50% or more of the objectives under a standard to have at least one corresponding assessment item for a minimal acceptable breath of coverage. The number of objectives under a standard is highly related to the difficulty in meeting the range-of-knowledge correspondence (breath in content). If a state lists a large number of objectives under a standard, then it is more difficult for a state to meet an acceptable level on range-of-knowledge because of the limited number of that can be used on an assessment. For example, State B (Table 3) in mathematics had six standards. Each standard had from 9 to 18 objectives for a total of 77 objectives. The high school test had a total of 51 items. Except for standard V, all of the other standards had from 17% to 38% of the objectives under the standard with at least one corresponding item. This proportion of the objectives with a corresponding item is well below the acceptable level of 50% of the objectives. Having an adequate breath of content on an assessment can be a trade off with the length of the assessment. An assessment with fewer items will have more difficulty assessing at least partially all of the objectives. Other factors come into play when considering breath. Some standards may have a larger number of objectives because the standard covers more content. For example, for state B as depicted in Table 3 and Figure 2, Standard II (Geometry and Measurement) has more objectives than Standard V (Numerical and Algebraic Operations), 18 objectives compared to 9 objectives. This suggests that the content under Standard II has been partitioned in more ways than content under Standard V. It could be that the objectives under geometry and measurement are more specific or it could be that the state considered that geometry and measurement had more content to cover. Another factor is that some of the objectives under Standard II may be more difficult to assess on an on-demand assessment, particularly if one item only measures content related to one objective. An on-demand assessment could cover more content by including more robust items that measure content associated with more than one objective or standard. Table 3 State B Range-of-Knowledge Correspondence High School Mathematics (N = 51 items) Standards # Hits # Objs Hit % Objs Hit Rng. of Know. Goals Objs Title Mean Mean Mean # # I - Patterns, Relationships and 2 11 10.44 4.22 38 NO Functions II - Geometry and 3 18 13 5.78 32 NO Measurement III - Data Analysis and 3 14 13.44 5 35 NO Statisti IV - Number Sense 3 14 2.78 2.44 17 NO and Numeration V - Numerical and Algebraic Operations 2 9 10.67 5.22 55 YES and Analytical ... VI - Probability and 2 11 6.89 3.67 33 NO Discrete Mathematics Total 15 77 57.22 4.39 35 The current decision rule of 50% of the objectives with at least one hit clearly is a very minimal requirement for alignment. A number of factors could be considered in judging the adequate range of content including the breath of content covered by a standard, the length of the assessment, the suitability of the content to be assessed on an on-demand assessment, and the difference in importance of different objectives under a standard. Considering this and other factors then could be used to develop over decision rules such as randomly sampling objectives under a standard, setting a minimum number of objectives under a standard to have a hit, or differentiating the importance of some standards from others by requiring more objectives under the most important standards being assessed than the least important standards. As with the other issues, there are multiple of considerations that need to be given to judging the adequacy of the alignment between an assessment and set of standards. Figure 2. Percent of objectives with one or more hits for each standard for State B high school mathematics. State B Range of Knowledge 100 80 Percent 60 40 20 0 e ns r. s s e ea t et ns ta pe Fc cr /M /S Se O s/ is is m /D rn g r ys be eo Al tte b al ro um G & Pa An -P um - -N II I- a VI -N at IV D V – Mathematics Standards III Degree of Emphasis Given Some Objectives It is reasonable to have some standards be more important than other standards and for some objectives under a standard to be more important than other objectives. The Balance of Representation alignment criterion, however, assumes that items should be fairly evenly distributed among the objectives under a standard. An index is used to depict balance: BALANCE INDEX 1 – (∑ │1/(O) – I (k) /(H )│)/2 k=1 Where O = Total number of objectives hit for the standard I (k) = Number of items hit corresponding to objective (k) H = Total number of items hit for the standard Table 4 shows the index value for three language arts standards for an analysis for one state. Figure 3 gives a pictorial representation that distribution of hits (test items) that were coded as corresponding to the different objectives for Standards I and II. For both of these standards, reviewers coded a large number of hits (over 250) as corresponding to one objective each standard. This resulted in index values of .57 and .68 that are below the acceptable level used of .70 (Table 4). The large number of hits is related to having writing sample that has a weighting of 12 compared to most items with a weighting of 1. At issue with balance is the degree that the amount of emphasis given to different objectives under a standard should vary. It is possible for state to accept a lower balance index value than .70. So State B could be satisfied with a balance index value for Standard 2 of .68 (Table 4) with a large emphasis on objective 2.4 (Figure 3). However, sometimes one objective is emphasized more than other objectives because it is easier to write assessment items for some objectives compared to other objectives. The main issue to resolve is how should alignment analyses consider the difference in emphasis by objectives. This issue relates to how the assessment blueprint differentiates among objectives and whether or not it is appropriate to have large variations among objectives. Table 4 State B Balance of Representation High School Language Arts (3 of 12 standards) (N = 116 items) Balance Index % Hits in Bal. of Standards Std/Ttl Index Represent. Hits Goals Objs Title Mean S.D. Mean S.D. # # I. - Meaning and Communication— 1 5.11 28 8 0.57 0.12 NO Reading II. - Meaning and Communication— 1 4 48 7 0.68 0.14 WEAK Writing VIII. - Genre and 1 5 17 6 0.63 0.16 WEAK Craft of Language Total 12 55.33 8 18 0.36 0.21 Figure 3. Number of hits for each objective under two standards for State B high school language arts. State B Balance of Representation Language Arts Standards 2.4 2.2 2 1.5 1.3 1.1 I. 0 50 100 150 200 250 300 350 400 Number of Hits Change in Depth-of-Knowledge Level Across Grades The final issue to be discussed is the change in complexity of content across grade levels. It is reasonable to expect that has students proceed through the grades that they will be expected to do more reasoning and analysis and less simple recall and recognition. This was the case for state A in mathematics (Figure 4) and in language arts (Figure 5). For both mathematics and language arts the percent of objectives with a depth-of-knowledge level of 1 (recall and recognition) decreased while the percent of objectives with a depth-of-knowledge level 3 (strategic reasoning) increased from grade 3 to grade 10 (Figures 4 and 5). However, DOK levels are dependent somewhat on grade level and what a typical student at a grade level can be expected to know and do. Reviewers in an alignment analysis developed by Webb are instructed to think about what a typical student should be expected to know and do in assigning what are the DOK levels of the content objectives. In reading the increase in complexity across grades may be due to having more sophisticated passages while the actual behavior or cognitive requirements stay relatively constant, such as determine the main idea. However, if along with more sophisticated passages students are expected to do more with drawing inferences or paraphrasing then the DOK levels may increase across grades. Currently there are really no fixed guidelines as to what is an acceptable progression in content complexity from grade to grade. In the absence of such guidelines, the progression of content complexity depicted in the Figures 4 and 5 for State A seem to be reasonable. The work of Wise and Alt (2005) on vertical alignment will help inform the field on this issue. Figure 4. State A mathematics DOK levels for objectives by grade. 100% Percent of DOK Levels 80% DOK Level 4 60% DOK Level 3 DOK Level 2 40% DOK Level 1 20% 0% 3 4 5 6 7 8 10 Grade Figure 5. State A reading language arts DOK levels for objectives by grade. 100% Percent of DOK Levels 80% DOK Level 4 60% DOK Level 3 DOK Level 2 40% DOK Level 1 20% 0% 3 4 5 6 7 8 10 Grade Conclusions In this paper, I have described a process that has been used to analyze the agreement between state academic content standards and state assessments. The Webb Alignment Process was developed for the National Institute for Science Education (NISE) and the Council of Chief State School Officers in 1997 and has evolved over time. A web-based tool is now available for aiding in conducting the process and analyzing the results. The process produces information about the relationship between a set of standards and an assessment by reporting on four main alignment criteria— categorical concurrence, depth-of-knowledge consistency, range-of-knowledge correspondence, and balance of representation. Five alignment issues were discussed. Each of these issues is related to one or more of the alignment criteria. These issues center around the basic question of what alignment is good enough. Specific rationale described in this paper has been used to set acceptable levels for each of the four alignment criteria. These acceptable levels have been specified for primarily pragmatic reasons such as assumptions on what would be considered as a passing score, the number of items needed to make some decisions on student learning, and relative low number of items that can be included on an on-demand assessment. The issues discussed arise from a change in these underlying assumptions and considering variations in the purpose of an assessment. The issues themselves are not resolved in this paper, nor was it the intent of the paper to do so even if they could be. The existence of these issues and other related issues just point to the fact that judging the alignment among standards and assessments requires using some subjectivity and cannot be solely based on a clear set of objective rules. This makes it critical for any alignment analysis to make clear what are the underlying assumptions and how conclusions are reached. References Blank, R. (2002). Models for alignment analysis and assistance to states. Council of Chief State School Officers summary document. Washington, D.C.: CCSSO. http://www.ccsso.org/content/pdfs/AlignmentModels.pdf Porter, A. C. (2002, October). Measuring the content of instruction: Uses in research and practice. Educational Researcher, 31(7), 3–14. Smith, M. S., & O’Day, J. A. Systemic school reform. In S. H. Fuhrman & B. Malen (Eds.), The politics of curriculum and testing (pp. 233–267). Bristol, PA: Falmer. Subkoviak, M. J. (1988). A practitioner’s guide to computation and interpretation of reliability indices for mastery tests. Journal of Educational Measurement, 25(1), 47–55. Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and science education. Council of Chief State School Officers and National Institute for Science Education Research Monograph No. 6. Madison: University of Wisconsin, Wisconsin Center for Education Research. Webb, N. L. (2002). Alignment study in language arts, mathematics, science, and social studies of state standards and assessments for four states. Washington, DC: Council of Chief State School Officers. Wise, L. L. & Alt, M. (2205). Assessing vertical alignment. Paper prepared for Council of Chief State school Officers. Alexandria, VA: Human Resources Research Organization, DFR 05-19.
Pages to are hidden for
"Alignment Analysis of Mathematics Standards and Assessments"Please download to view full document