The Looming Shadow research by JamieThackray




     Can the threat of vouchers persuade
     a public school to turn itself around?
        The case of Florida suggests yes

ability system with teeth. Each public school is assigned a grade based on
the performance of its students on the Florida Comprehensive Assess-
ment Test (FCAT) in reading, math, and writing. Reading and writing
FCATs are administered in the 4th, 8th, and 10th grades; students take
the math FCAT in the 5th, 8th, and 10th grades.The scale-score results
from these tests are divided into five categories.The letter grade that each
school receives is determined by the percentage of its students scoring
above the thresholds established by these five categories or levels. If a
school receives two F grades in a four-year period, its students are offered
vouchers that they can use to attend a private school. They are also
offered the opportunity to attend a better-performing public school.
    The FCAT was first administered in the spring of 1998. So far, only
two schools in the state, both located in Escambia County, have
received two failing grades, the second coming during the 1999 round
of testing in both cases. Students in both schools were offered vouch-
ers, and nearly 50 students and their families chose to attend one of
a handful of nearby private schools, most of which were religiously
affiliated. No additional schools were subject to the voucher provi-
sion after the 2000 administration of the FCAT because none failed
for a second time.
    The theory undergirding this system is that schools in danger
of failing will improve their academic performance to avoid the polit-
ical embarrassment and potential loss in revenues from having their
students depart with tuition vouchers.Whether the theory accords
with the evidence is the issue addressed here. Perhaps the threat
of vouchers being offered to students will provide the impetus for
reform. But it is also plausible that schools will develop strategies

                          by JAY P. GREENE

76    E D U C AT I O N N E X T / W I N T E R 2 0 0 1                 
The greatest improvements
should be seen among
schools that had already
received one F grade from
the state, since their
students would become
eligible for vouchers if
they received a second F.
for improving the grade they receive from
the state without actually improving the
academic performance of students. Per-                 dents pass exams in
haps schools will not have the resources               order to be promoted to the next grade.
or flexibility to adopt necessary reforms                   The comprehensive nature of Texas’s           The second report, released in Octo-
even if they have the incentives to do so.             accountability system and the fact that its   ber of 2000 by Stephen Klein and his
Perhaps the incentives of the account-                 governor was a candidate for the presi-       colleagues, cast doubt on the validity of
ability system interact with the incen-                dency attracted considerable attention        TAAS scores by suggesting that the
tives of school politics to produce unin-              to the TAAS. The most systematic              results do not correlate with the test
tended outcomes.                                       research on TAAS appeared in two some-        results of other standardized tests.
    The evidence suggests that the theory              what contradictory reports issued by          Because the other standardized tests are
holds true: that the A-Plus program has                the RAND Corporation (for a critique          “low-stakes tests,” without any reward
been successful at motivating failing                  of both reports, see Eric Hanushek’s          or punishment attached to student or
schools to improve their academic per-                 “Deconstructing RAND” in the Spring           school performance, the authors reason
formance. The gains, moreover, seem to                 2001 issue, available on-line at     that there are few incentives to manipu-
reflect real improvement rather than a         In the first report,         late the results or cheat, making the low-
mere manipulation of the state’s testing               released in July of 2000, David Grissmer      stakes test results a reliable measure of
and grading system.                                    and his colleagues analyzed scores from       student performance (although it is also
                                                       the National Assessment of Educational        possible that schools and students won’t
                                                       Progress (NAEP), a test administered          prepare enough for a low-stakes test to
The Literature                                         by the U.S. Department of Education, in       demonstrate their true abilities). By con-
The question of whether testing and                    order to identify state policies that may     trast, schools and students might have
accountability systems are an effective                contribute to higher academic perfor-         incentives and opportunities to manipu-
reform tool has seldom been the subject                mance.They found that states like Texas       late the results of high-stakes tests, like
of rigorous research. Most research atten-             and North Carolina, with extensive            TAAS.The dissonance between the dif-
tion has been devoted to evaluations of                accountability systems, were among the        ferent tests, the authors argue, should at
the accountability system in Texas. The                highest-scoring and fastest-improving         least raise a red flag regarding the gains
Texas Assessment of Academic Skills                    states after demographic factors were         observed on TAAS. Klein and his col-
(TAAS) has been in existence for a                     controlled for. The report featured a         leagues also analyzed NAEP results in
decade and is the most comprehensive of                lengthy comparison of student perfor-         Texas, and, contrary to the findings of
all state testing systems. Students in Texas           mance in California, which has an under-      Grissmer and his colleagues, concluded
are tested in 3rd through 8th grades in                developed accountability system and weak      that Texas’s performance on NAEP was
math and reading. In addition, students                academic performance,and Texas to high-       not exceptionally strong.
must pass an exit exam first offered in                 light the importance of TAAS in improv-           Klein and his colleagues, however,
10th grade in order to graduate.The state              ing academic achievement, as measured         cannot rule out alternative explanations
is also phasing in requirements that stu-              by NAEP.                                      for the weak correlation between TAAS

78    E D U C AT I O N N E X T / W I N T E R 2 0 0 1                                                            
                                                     VOUCHERS IN FLORIDA GREENE

                  results and the results of     none of the studies released by the Eco-        offers the possibility of some important
                 low-stakes standardized         nomic Policy Institute confirm the valid-        contributions to the existing research lit-
                 tests. It is possible that      ity of the state testing results by com-        erature.An evaluation of A-Plus can reveal
                TAAS, which is based on          paring them to the results on national          whether the prospect of competition, in
                the mandated Texas cur-          exams. It is possible that the critics of       the form of vouchers offered to students
               riculum, tests different skills   testing are right, that some or even all of     at chronically failing schools,represents an
               than those tested by the          the gains measured only by state tests          effective incentive for improvement.
               national standardized tests.      are the product of teaching to the test,        Unlike other studies of accountability
               Both could produce valid          cheating, or other manipulations of the         systems, the ability to validate the scores
              results and still be weakly cor-   testing system. In addition, the pre-           used in the A-Plus system by comparing
              related with one another if        A-Plus Florida analysis reported by the         them with performance on nationally
             they are testing different          Economic Policy Institute is plagued by         normed exams offers the possibility of
             things. It is also possible that    several research design flaws. For exam-         dispelling concerns about whether the
            the pool of standardized tests       ple, the study compares results from            observed gains are real or the products of
            that were available to the           schools that took several different stan-       teaching to the test, cheating, or manip-
           RAND researchers were not             dardized tests without making any effort        ulation of the testing system.
           representative of Texas as a          to ensure that the results are comparable.
          whole. The standardized test           And because only pass rates were avail-
          results that were compared with        able, the scale scores analyzed were esti-      Validating the FCAT Results
          TAAS results were only from            mated based on a series of assumptions.         The first section of the analysis addresses
         2,000 non-randomly selected 5th-            The research presently available on         the question of whether Florida’s test is
grade students from one part of Texas. If        the potential of vouchers to improve            a valid test of students’academic abilities.
this limited group of students were not          achievement in public schools is also less      Given the concerns raised by the Klein
representative of all Texas students, it         than conclusive. Recent studies by econ-        study regarding the validity of the TAAS
would be inaccurate to draw any conclu-          omist Caroline Minter Hoxby, as dis-            exams in Texas, I decided to use the same
sions about TAAS as a whole.
    Another examination of NAEP                  If a school receives two F grades in a four-year
scores in Texas, which I conducted,
showed that NAEP improvements were               period, its students are offered vouchers that
exceptionally strong in Texas while the
TAAS accountability system was in place.         they can use to attend a private school.
The disparate findings regarding the rela-
tionship between Texas’s scores on TAAS          They are also offered the opportunity to
and NAEP can be partially explained by
differences in the time periods and grade        attend a better-performing public school.
levels examined, and by the presence or
absence of controls for student demo-            cussed in this issue, have attempted to         analytical technique
graphics. For now it is enough to say that       address this question by examining the          as Klein: comparing
there is some ambiguity regarding any            consequences of variation in the extent of      results on the FCAT
conclusions that can be drawn from a             choice currently available in the United        with results on low-
comparison of NAEP and TAAS results.             States.They suggest that areas with more        stakes standardized tests
This ambiguity is in part a result of the        choice and competition experience bet-          given at around the same
fact that NAEP is administered infre-            ter academic outcomes than areas with           time and in the same grade.
quently and only in certain grade levels.        less choice and competition.While these             During the spring of 2000, Florida
    A more recent collection of studies          results support the contention that             schools administered both the FCAT
edited by Martin Carnoy of Stanford              vouchers would improve the quality of           and a version of the Stanford 9, which is
University and issued by the Economic            education for the entire education system,      a widely used and respected nationally
Policy Institute finds that the account-          it remains to be seen whether even the          normed standardized test. Performance
ability systems in Texas, North Carolina,        prospect of competition can provoke a           on the FCAT determined a school’s grade
and Florida (before the adoption of A-           public school response.                         from the state and therefore determined
Plus) all motivated failing schools to pro-          Studying the A-Plus accountability          whether students would receive vouchers.
duce significant gains. Unfortunately,            and choice system in Florida therefore          Performance on the Stanford 9 carried no                                                                          W I N T E R 2 0 0 1 / E D U C AT I O N N E X T   79
similar consequences, so schools and stu-                                                         Feeling the Pressure (Figure 1)
dents had little reason to manipulate,
cheat, or teach to the Stanford 9. If the                                      Schools earning F's from Florida's accountability system, and thus facing the threat
                                                                               of vouchers, made major gains in reading, math, and writing from 1999 to 2000.
results of the Stanford 9 are similar to the
results of the FCAT, the FCAT is likely                                                    1.9
to be a valid measure of academic achieve-                                                                         11.0
ment.If the results are not similar,it is pos-                                                   4.9

                                                           Quality of school
                                                                                 B                                                                             Math
sible that the FCAT results are not a valid                                                                  9.3
measure of student performance.                                                                  4.6
     The results of this analysis suggest                                                                             11.8
that the FCAT results are valid measures                                                                      10.0
of student achievement. Schools with the                                                                                             16.1
highest scores on the FCAT also had the                                                                                                   17.6*
highest scores on the Stanford 9 tests                                                                                                                         25.7*
that were administered around the same                                                0                            10                                   20                   30
time in the spring of 2000. Likewise,                                                             Gains in test scores from 1999 to 2000 (in FCAT scale points)
schools with the lowest FCAT scores
tended to have the lowest Stanford 9                                             A                             0.36
scores. If the correlation were 1.00, the                                                                                                                      Writing
results from the FCAT and Stanford 9 test
                                                           Quality of school

                                                                                 B                                   0.39
would be identical.As it turns out,the cor-
relation coefficient was 0.86 between the                                         C                                          0.45
4th grade FCAT and Stanford 9 reading
test results. In 8th grade the correlation                                       D                                                 0.52
between the high-stakes FCAT and low-
stakes standardized reading test was 0.95.                                       F                                                                             0.87*
In 5th-grade math, the correlation coef-
ficient was 0.90; in 8th-grade math, the                                              0.0               0.2                0.4                     0.6         0.8            1.0
correlation was 0.95; and in 10th-grade                                                           Gains in test scores from 1999 to 2000 (in FCAT scale points)
math, the correlation was 0.91. In other                 * Change for F schools compared to schools with higher grades is significant at p < .01. Math
words, the results of the two tests are                    and reading scales run from 100 to 500. The writing scale runs from 0 to 6.
quite similar. (It was not possible to ver-              SOURCE: Author's estimates based on data from the Florida Department of Education.

ify the validity of the FCAT writing test
with this technique because no Stanford                high, ranging from 0.77 to 0.99. It appears                                  broken out by the grade they received
9 writing test was administered.)                      as if the pressures placed on previously                                     the year before.
     In the second RAND study of TAAS                  failing schools did not lead them to dis-                                        In fact, the incentives appear to oper-
in Texas, Klein and his colleagues never               tort their test results.                                                     ate as expected. Schools that had received
found a correlation of more than 0.21                                                                                               F grades in 1999 experienced the largest
between the school-level results from                                                                                               gains on the FCAT between 1999 and
TAAS and the school-level results from                 The Prospect of Vouchers                                                     2000.The year-to-year changes in school-
low-stakes standardized tests.In this analy-           Now that the validity of the FCAT as a                                       level FCAT results did not differ sys-
sis there was never a correlation between              measure of student performance has been                                      tematically according to whether the
FCAT and the Stanford 9 below 0.86.                    established, the question of whether                                         school had received a grade of A, B, or C
     To exclude the possibility that teach-            vouchers inspired improvement among                                          from the state. Schools that had received
ing to the test, cheating, or manipula-                Florida’s failing schools can be studied.                                    D grades and were close to the failing
tion occurred only among schools that                  The greatest improvements should be                                          grade that could precipitate vouchers’
were previously failing, I also examined               seen among schools that had already                                          being offered to their students, by con-
the correlations between the FCAT and                  received one F grade from the state, since                                   trast, appear to have achieved somewhat
Stanford 9 results among this subset of                their students would become eligible for                                     greater improvements than those
schools. This revealed that even among                 vouchers if they received a second F. To                                     achieved by the schools with higher state
previously failing schools the correlations            test this hypothesis, average FCAT scale-                                    grades. Schools that received F grades in
between the two test results remain very               score improvements for schools were                                          1999 experienced increases in test scores

80    E D U C AT I O N N E X T / W I N T E R 2 0 0 1                                                                                               
                                              VOUCHERS IN FLORIDA GREENE

that were more than twice as large as          of the extraordinary gains realized by           influence of differences in the background
those experienced by schools with higher       previously failing schools. It is also plau-     of students in each group or in the addi-
state-assigned grades.                         sible that the extraordinary gains of fail-      tional resources provided to each group.
    On the FCAT reading test, which            ing schools were the result of their being           Comparing the demographic charac-
uses a scale with results between 100 and      provided with additional resources not           teristics of high-scoring F schools and
500, schools that had received an A grade      available to other schools. And some             low-scoring D schools confirms that the
from the state in 1999 improved by             observers have speculated that the excep-        two groups are quite similar.They also do
an average of 2 points between 1999 and        tional gains observed in Florida could be        not differ significantly in their initial per-
2000 (see Figure 1). Schools that had          explained by a change in rules regarding         pupil spending, average class size, per-
received a B grade improved by 5 points.       the test scores of high-mobility students        centage of students receiving subsidized
Those earning a C in 1999 increased by         who move in and out of schools and dis-          school lunches, percentage of students
5 points. By contrast, schools with a D        tricts often.                                    with limited English proficiency or dis-
grade in 1999 improved by 10 points.                To test these alternative explanations      abilities, and the mobility of their student
Schools with F grades in 1999 showed an        I compared the improvements recorded             populations.
average gain of 18 points, equal to 0.8        by F-level schools that had above-aver-              Note that the comparison between
standard deviations. In other words, the       age initial scores for their category with       high-scoring F schools and low-scoring D
lower the grade in 1999, the greater the       D-level schools that had below-average           schools is likely to underreport the true
improvement in 2000.                           initial scores for their category.The intu-      effect of labeling schools as failing and
    A similar pattern emerged in the           ition here is that high-scoring F schools        forcing them to face the prospect of
FCAT math results. Schools earning an          and low-scoring D schools are very much          vouchers.The comparison only measures
A grade experienced an average 11-point        alike initially, yet one group is subject to     the amount by which certain F schools
gain. Schools with a B gained 9 points.        the accountability system’s punishments          outperform certain D schools, ignoring
Schools with C grades in 1999 showed           (the F label and the prospect of vouch-          the possibility that D schools are also
gains of 12 points, on average, between        ers), while the other group of schools is        inspired to improve for fear of failing for
1999 and 2000. Schools earning D grades        not. In many ways this comparison                the first time. Indeed, simply assigning
improved by 16 points, while schools that      approximates a randomized experiment.            grades to schools may inspire them to
received F grades in 1999 made gains of 26     Because the two groups were so close to          improve in order to get better grades.All
points, equal to 1.25 standard deviations.     the threshold dividing D and F schools,          schools face this incentive to some degree.
    The FCAT writing exam, whose               chance may explain to a fair degree why              Nevertheless, high-scoring F schools
scores range from 0 to 6, also shows larger    these schools received one grade or the          did experience gains larger than their
gains for schools earning an F grade in        other.This is not to say that grading sys-       low-scoring D counterparts. After con-
1999. Schools with an A grade in 1999          tems are inherently arbitrary; it is only        trolling for average class size, per-pupil
improved by 0.4 points on the writing          recognizing the reality that luck is an          spending in 1998–99, the percentage of
test; B schools had an average gain of 0.4     important factor at the margins.                 students with disabilities, the percent-
points; and C schools gained 0.5 points.            The initial similarity between the two      age of students receiving a free or reduced-
D schools improved 0.5 points, while F         groups of schools allows us to be confident       price school lunch, the percentage of
schools demonstrated an average gain of        that any difference in the gains realized by     students with limited English proficiency,
.9 points, equal to an astounding 2.2 stan-    high-scoring F schools and low-scoring D         and student mobility rates, high-scoring
dard deviations.                               schools is the result of the accountability      F schools achieved gains that were 2.5
                                               system and not other factors. Regression         points greater than their below-average
                                               to the mean cannot explain the gains of          D counterparts in reading (see Figure 2).
Alternative Explanations?                      high-scoring F schools relative to low-          The math results show that the prospect
The fact that gains among schools facing       scoring D-schools because both groups            of vouchers inspired additional gains of
the prospect of vouchers were nearly           begin with similarly low scores. In fact,        5.2 points. On the writing test, which
twice as large as the gains achieved by        because the letter grade is based on the         has a scale of 1 to 6, the effect was 0.2
other schools might be at least partially      percentage of students scoring above cer-        points, although, since the validity of the
attributable to other factors. One possi-      tain thresholds and not on the average           FCAT writing test cannot be confirmed,
ble factor is regression to the mean, the      score in each school, the high-scoring F         this finding is less definitive. Therefore,
statistical tendency for very low or very      schools actually have slightly higher ini-       schools that received an F grade—and
high scores to move closer to the group        tial reading and math scores than do the         faced the prospect of vouchers should
average when retested. This common             low-scoring D schools. In addition, sta-         they receive another F—experienced
dynamic could account for at least some        tistical techniques can control for the          gains superior to those made by schools                                                                         W I N T E R 2 0 0 1 / E D U C AT I O N N E X T   81
                                             Isolating the Voucher Effort (Figure 2)                                              additional spending it would take to
                                                                                                                                  produce gains as large as those pro-
        To test whether it was the threat of vouchers that motivated schools to improve, compare the                              duced by labeling schools and threat-
         gains in lower-scoring D schools with the gains in higher-scoring F schools. The only real
                                                                                                                                  ening them with vouchers. According to
        difference between the two is that F schools were faced with the threat of vouchers, yet their
                                 gains were larger than those in D schools.                                                       the models comparing high-scoring F
                                                                                                                                  schools with low-scoring D schools, to
                                                                                                                                  achieve the same 5-point gain in math
  1999 to 2000 (in FCAT scale points)
   Gains on standardized tests from

                                                         Lower-scoring D schools                                                  that the threat of vouchers accom-
                                        25               Higher-scoring F schools
                                                                                                               25.9*              plished, Florida schools would need to
                                        20                                                                                        increase per-pupil spending by $3,484 at
                                                                                               20.7                               previously failing schools. This would be
                                        15                                                                                        an increase of more than 60 percent in
                                        10                                                                                        education spending. To realize the same
                                                                                                                                  gain as the A-Plus program accom-
                                                                                                                                  plished in reading, Florida schools
                                        0                                                                                         would need to spend $888 more per
                                                           Reading                                     Math                       pupil, more than a 15 percent increase
                                                                                                                                  in per-pupil spending. To produce the
 * Statistically significant at p < .05
                                                                                                                                  same gain in writing scores, per-pupil
 SOURCE: Author's estimates based on elementary-school data from the Florida Department of Education                              spending would have to be increased
                                                                                                                                  by $2,805, more than a 50 percent
at a similar level of performance but that                                          precisely because they were taking action     increase.
did not face the threat of vouchers.                                                to avoid receiving a second F. The fact           For many years policymakers have
    The larger gains made by schools fac-                                           that including additional resources in        focused on providing schools with
ing the threat of vouchers cannot be                                                the analysis does not diminish the mag-       enough resources to educate students.
explained by spending increases. While                                              nitude of the motivational effect of          The evidence from the A-Plus account-
F schools did receive additional                                                    vouchers suggests that the results are        ability and choice program suggests that
resources—about $600 per pupil in addi-                                             quite robust. Furthermore, the fact that      policymakers must also ensure that
tional funding, compared with about                                                 controlling for the rate of student mobil-    schools are provided with the appropri-
$200 per pupil in D schools—taking this                                             ity does not have any effect on the results   ate incentives to use their resources effec-
                                                                                                                                        tively. Grading schools and using
To achieve the same gain in math that the threat of                                                                                     vouchers as a sanction for repeated
                                                                                                                                        failure inspires improvement at
vouchers accomplished, Florida schools would need to                                                                                    schools in a way that simply pro-
                                                                                                                                               viding additional resources
increase per-pupil spending by $3,484 at schools                                                                                                 cannot. The evidence from
                                                                                                                                                 Florida also suggests that
that had earned an F. This would be an increase                                                                                                 the gains produced by such
                                                                                                                                               an accountability system are
of more than 60 percent in education spending.                                                                                                 real indicators of improve-
                                                                                                                                               ment in learning, and not sim-
additional spending into account does                                               suggests that the exceptional                             ply teaching to the test, cheat-
not alter the extra gains achieved by                                               gains achieved by F schools were                          ing, or other manipulations of
schools that faced the prospect of vouch-                                           not caused by a change in the                           the testing system. Whether the
ers. This is an especially important find-                                           rules concerning the treatment of                      same gains could have been pro-
ing because the additional resources                                                high-mobility students.                                duced using alternative sanctions
obtained by F schools may have been at                                                                                                     is unknown. But the fact is that
least partially the result of the threat of                                                                                              vouchers were used, and they were
vouchers. That is, school districts may                                             Conclusion                                      unquestionably effective.
have allocated more money to failing                                                To put the magnitude of the voucher
schools or failing schools may have been                                            effect into perspective, the same mod-        –Jay P. Greene is a senior fellow at the
more aggressive in their grant writing                                              els can be used to calculate how much         Manhattan Institute for Policy Research.

82                                 E D U C AT I O N N E X T / W I N T E R 2 0 0 1                                                             

To top