The emphasis placed on standardized tests in elementary and secondary education has been steadily increasing over the past decade. Recent research examines how the pressure to boost test scores can lead to cheating by teachers and administrators.
Standardized test scores are designed to provide an objective measure of student achievement. Consequently, the use of test outcomes to punish or reward schools and students, a practice referred to as "high-stakes testing," is now extremely widespread.
Proponents of high-stakes testing argue that the stronger incentives associated with tangible, quantitative measures of performance will lead teachers and students to work harder. Opponents have worried that the emphasis on standardized tests will lead teachers to "teach to the test," or cut out subjects such as social sciences to emphasize reading and mathematics.
"In the debate over high-stakes testing, neither side has expressed concern that teachers may respond to these stronger incentives in more diabolical ways, such as outright cheating," says University of Chicago economics professor Steven Levitt.
In the recent study, "Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating" Levitt and coauthor Brian A. Jacob of Harvard University's John F. Kennedy School of Government find that as incentives for high test scores increase, unscrupulous teachers may be more likely to engage in a range of illicit activities, including changing student responses on answer sheets, providing correct answers to students, obtaining copies of exams illegitimately prior to the test date, and teaching students the answers to precise test questions.
"The way teachers respond to incentives is just human nature," says Levitt. "To quote W. C. Fields: 'Anything worth winning is worth cheating for.'"
Levitt and Jacob developed a set of tools that they refer to as a "cheating detection algorithm," which uses economic theory, statistical measures, and available data to uncover outright teacher cheating on standardized tests.
The study identifies the overall prevalence of teacher cheating and the factors that predict cheating. Levitt and Jacob's results highlight the fact that incentive systems with fixed rules often induce behavioral distortions such as cheating.
For standardized tests, Levitt and Jacob find that cheating classrooms will systematically differ from other classrooms because they show unusually large fluctuations in test scores and suspicious patterns of answers.
Using data from the Chicago Public Schools (CPS), the authors estimate that serious cases of teacher or administrator cheating on standardized tests occur in a minimum of 4 to 5 percent of elementary school classrooms annually. The frequency of cheating appears to be strongly linked to minor changes in incentives.
"Incentives are powerful, but they are a double-edged sword," says Levitt. "Incentives can change behavior for the better and for the worse."
For one week each May, third through eighth grade students in the Chicago Public Schools take a standardized, multiple-choice achievement exam known as the Iowa Test of Basic Skills (ITBS), a national exam with a reading comprehension section and three math sections.
Students mark their responses on answer sheets, which are scanned to determine their score. Teachers or administrators then "clean" the answer keys, erasing stray pencil marks, removing dirt or debris from the form, and darkening item responses that were only faintly marked by the student.
Prior to 1996, Iowa Test scores were mainly used to provide teachers and parents with a sense of how a student was progressing academically. Beginning in 1996, with the appointment of Paul Vallas as CEO of the Chicago Public Schools, CPS launched an initiative designed to hold students and teachers accountable for student learning.
Elements of the reform included putting schools on "probation"-a highly undesirable circumstance-if less than 15 percent of students scored at or above national norms on the ITBS reading exams. Probation schools that do not show enough improvement may be reconstituted, which involves closing the school and dismissing or reassigning the staff.
The second part of the reform included an end to social promotion, the practice of passing students to the next grade regardless of their academic skills or school performance. Under the new policy, students in the third, sixth, and eighth grades must meet minimum standards on ITBS in both reading and math in order to be promoted to the next grade.
The authors use detailed administrative data from 1993 to 2000, including question-by-question answers given by every student taking the ITBS. The authors also have access to each student's full academic record, including past test scores, the school and room to which a student was assigned, and extensive demographic and socio-economic characteristics. Their final data set includes approximately 20,000 students per grade per year distributed across approximately 1,000 classrooms.
Given that the aim of cheating is to raise test scores, one signal of cheating is unusually large gains in test scores for students in the year the cheating occurs, followed by very small test score gains (or even declines) the following year. Since test score gains that result from cheating do not represent real gains in knowledge, there is no reason to expect gains to be sustained on future exams taken by these students. In contrast, if large test score gains are due to a talented teacher, student gains are likely to be more permanent.
A second indicator is distinctive patterns of suspicious answer strings. The crudest form of teacher cheating is likely to leave tell-tale signs in the form of blocks of identical answers for many students in the classroom. Teachers can quickly and easily alter test forms by erasing answers and filling in the correct responses. Another element to the suspicious answer strings are cases where many students answer most of the easy questions wrong and get most of the hard questions right.
The authors find that the introduction of social promotion and probation policies are positively connected with the likelihood of classroom cheating, and cheating rates in the lowest performing classrooms are the most sensitive to changes in incentives.
A Unique Policy Intervention
In the spring of 2002, Arne Duncan, the new CEO of the Chicago Public Schools, having read Levitt and Jacob's research, invited the authors to work with CPS administrators to design and implement auditing and retesting procedures using their cheating detection algorithm. Levitt and Jacob detail the results of this experiment in their follow-up study, "Catching Cheating Teachers: The Result of an Unusual Experiment in Implementing Theory."
A few weeks after the initial ITBS exam was administered, the authors retested 117 classrooms under controlled circumstances that precluded cheating. Of the classrooms retested, 76 were suspected of cheating during the initial exam: 51 with suspicious answer strings and large test score gains, 21 with only suspicious answer strings, and 4 with anonymous allegations of cheating made to CPS officials.
They also retested two sets of classrooms as control groups: the first control group was classes with large test score gains, but no evidence of cheating in answer strings; the second control group consisted of randomly selected classrooms.
The classrooms identified as likely to have cheated experienced test score gains on the initial spring 2002 tests that were nearly twice as large as a typical CPS classroom. Consistent with their hypothesis, the authors found that on the retest, gains for those classrooms completely disappeared, most notably in the reading scores.
One subset of classrooms suspected of cheating had only average test score gains on the initial test, even with the suspected cheating. This implies that the teachers taught almost nothing and cheated to raise their classrooms' scores up to the average. These classrooms were expected to have large test score declines on the retest, which proved to be the case.
In contrast, classrooms identified as having good teachers that had not cheated scored even higher on the reading retest, while math scores fell slightly. The randomly selected classrooms maintained nearly all of their gains when retested.
In 29 classrooms, the test score declines averaged more than one grade-equivalent across the subjects tested. CPS staff further undertook investigations of these 29 classrooms, including analysis of erasure patterns and on-site investigations. A substantial number of cheating teachers were disciplined for their actions, including some dismissals.
The data generated by the auditing experiment also provided the authors with a unique opportunity to improve their cheating detection techniques.
"During the retest, we found that the most effective way of catching cheating was looking at cases where students get the easy questions wrong and the hard questions right," says Levitt.
The results of the experiment provide compelling evidence that these methods can successfully identify cheating classrooms, as well as identify classrooms with good teachers whose gains are legitimate and possibly deserving of rewards.
Costs and Benefits of Incentives
Levitt and Jacob argue that the obvious benefits of high-stakes tests as a means of providing incentives must be weighted against possible distortions that these measures induce. Explicit cheating is just one form of distortion.
There are several ways to reduce cheating on standardized tests: lower the payoff (incentive), make it more difficult to cheat, or increase the severity of punishment.
The kind of cheating that the authors focus on could be reduced at relatively low cost through the implementation of proper safeguards, such as those used by the Educational Testing Service on the SAT and GRE exams, which require independent proctors.
However, even if this type of cheating is eliminated, the study highlights the nearly unlimited capacity of human beings to distort behavior in response to incentives.
"Ultimately, the aim of public policy should be to design rules and incentives that provide the most favorable trade-off between the real benefits of high-stakes testing, and the real costs associated with behavioral distortions aimed at artificially gaming the system," says Levitt.