Tuesday, September 20, 2016

Louisiana Accountability System Changes to Reflect Political Goals

Note to Readers: The following post by Herb Bassett traces the history of test-based accountability in Louisiana from 1999 to the present. Mr Bassett's analysis demonstrates that the grading and ranking of our schools and teachers based on student test scores is purely arbitrary and is constantly being manipulated by our State Superintendent to produce whatever results suits his objective at the time. 

Several years ago reformers chose to portray a large number of our schools as failures so that there would be an excuse for "reforms" such as school privatization and reductions in teacher job protections. Now that the reformer policies have been in effect for a number of years, the testing data is being manipulated to demonstrate "improvement". But such alleged "improvement" is not supported by the results of the National Assessment of Educational Progress tests (NAEP).  These test results show that since 2005, Louisiana students have achieved only small improvements in national test results, and have actually lost ground compared to all other states in three out of the 4 categories tested. 

Test based accountability in Louisiana has failed to make any significant improvement in student performance. At the same time our students' futures have become more limited because of the increasing emphasis on college prep for-all, at the expense of vocational prep opportunities. 

The scapegoating of teachers for low student performance that has been an inherent part of the reforms has demoralized and de-professionalized the teaching profession in Louisiana. Many highly respected teachers have retired early. Meanwhile the natural joy of teaching and learning has been replaced by a constant, dull, non productive, test-prep exercise with no end in sight.

Mr Bassett's analysis

The Every Student Succeeds Act (ESSA) requires us to revamp our school accountability system. As we decide what changes to make, we should examine our current status, take lessons from the history of our statewide tests, and recognize how politicized our accountability system has become. It is imperative that we bring nuanced understanding to the decision making process.

History warns that the results of a single summative measure can be shaped by political considerations.

My hope here is to restore history that John White has removed from public view and reveal the current strategy to produce an illusion of ever-improving student performance by changing the metrics. This link is to the GEE 21, LEAP/iLEAP, LAPARCC and EOC data I compiled for this study.

I. The politicization of the accountability system:

Louisiana set up its accountability system even before No Child Left Behind. The first school performance scores were given in 1999. In 2003, the Louisiana Legislature created the Recovery School District to take over failing schools. The definition of failure was a school performance score of below 60. Only five schools had been taken over for their failing status before hurricane Katrina.

Shortly after Katrina, the Louisiana legislature arranged for the takeover of the majority of the New Orleans schools. Act 35 allowed the Recovery School District to take over schools with school performance scores simply up to the state average of 87.4. Only New Orleans schools were targeted at the time. Afterward, school performance scores became increasingly politicized.

A star rating system of schools was in use from 2003 through 2010. A push for reforms began in the fall of 2011, when school letter grades were instituted to create a sense of urgency. Forty-four percent of schools were suddenly declared to be failing because they were graded D or F. This provided momentum for the passage of education reforms in Acts 1 and 2 of the 2012 legislative session.

Acts 1 and 2 tied tenure and teacher evaluations to student test scores, and expanded charter schools and vouchers. Would the changes spur increased student achievement? The reformers would soon need evidence of improvement.

While the reform legislation was being passed, John White, the then-new state superintendent, pushed new school accountability formulas for 2013 through BESE. The new formulas were virtually guaranteed to yield higher overall school letter grades.

Meanwhile, the requirement for a D was raised by 10 points in 2012 in order to produce more Fs. This would help to maintain the sense of urgency created the year before. But instead, good news came at an embarrassing time for the reformers.

When the 2012 letter grades were released, over one-third of our high schools suddenly were "A" schools. This was due to the changeover from using the old GEE 21 to the new End-Of-Course tests in the accountability formulas. The astounding illusion of a turnaround appeared before the new legislation had gone into effect.

The 2012 high school letter grade inflation was rectified by the new accountability formulas in 2013.
The new formulas put the high school letter grades back in linewith the 2011 results, but the K-8 schools got a huge boost. The new formulas were completely different from the old, and the grading scale was changed from a 200 point system to a 150 point system.  Bonus points were added.

At the time, I ran the 2011 test data for each school through the old and new formulas and found that - based on exactly the same test data - the new formulas alone would yield much higher K-8 letter grades. This LDOE file confirms the inflationary shifts in the 2013 results.

The 2013 LDOE press release however attributed the improved letter grades to a slight improvement in student test scores that year despite the reality that most of the letter grade gains came directly from the new formulas. 

"Letter grade outcomes improved overall for the 2012-2013 school year because of record setting student achievement: 71 percent of students tested at Basic or above this year in ELA and math, graduation rates are at an all-time high, a record number of students earned college-entry ACT scores this year, and Louisiana students increased their AP participation more than any other state."

After 2013, to avoid political fallout during a time of test transition, LDOE promised to curve the school letter grades as needed to maintain the letter grade distribution.

The bigger picture is this. The accountability formula changes were just one part of a planned series of changes to guarantee rising measures that would shield the reforms from any real scrutiny. Changes to the tests themselves were the next step.

Our test history shows that we can expect test scores to rise simply because the tests are new. There also are tweaks that can be made to raise or lower the scores as the political winds change.

By instituting the reforms hand-in-hand with radical changes to the accountability formulas, standards, and the statewide tests, the reformers ensured there would be no consistent measure of actual student improvement.

The tests that existed before 2012 show slowed gains or reversals after 2012 until they were phased out. Growth on our grade 3-8 tests slowed after 2012, and the proficiency rates on three of our four oldest high school EOCs are currently in decline.

The ACT is another longstanding test on which we are in a slight decline. White spins that a slightly higher number of students this year earned an 18 than ever before. True, for the number as he states, but the percentage of students scoring 18 fell nominally, by a half percent. According to the official October 2014 and 2015 multi-stats reports, the senior class this year was enough larger than last year to make the difference.

And this decline comes after schools have increasingly pushed students to repeat the test until they score an 18 or higher. Some schools recently have even required students to take an ACT prep elective until the score of 18 is reached.

The consistent data paint a picture of decline and it explains why the reformers would coordinate a series of test changes and accountability formula changes with the reform legislation. With enough changes to the metrics, there would be no way left to tell if students really benefited from the reforms.

Soon we will have only the reformers' spin on our statewide tests.

II. Our current status no longer can be compared with the past:

The August 4 press release, "Louisiana Students Show Improvement as Schools Adjust To Higher Expectations" told us that "raised expectations" put more students at the Mastery or above (Mastery+) achievement level. It vaguely reviewed changes to the tests between 2012 and 2016, but did not clarify that there are simply no statistical connections between those tests.

The "improvement" is much better explained as a by-product of setting different curves on the totally different tests given in those years.

At the same time state superintendent John White released that spin, he cleansed years of test data from LDOE's current LouisianaBelieves website. Gone are the LEAP and iLEAP results from 2008-2013. Gone are the files showing the year-to-year progress at the Basic+ level that we tracked for a decade.

The files were already up; White removed them from the website.

This was nothing new. Superintendent White began purging test and demographic data when he replaced LDOE's former website with LouisianaBelieves in January, 2013. At that time, fourteen years of records were removed from public view. The limited amount of data that was re-posted on LouisianaBelieves had much of the important information redacted.

Taking down data from public view protects his spin from scrutiny. The latest data he removed - and then referenced in the press release - would remind us that the test score shifts of the last two years are totally out of line with the rest of our test history.

From 2006 to 2014, statistical test equating ensured the year-to-year comparability of the LEAP/iLEAP tests.  But in 2015, the new LAPARCC test could not be equated to the old tests. The ELA and Math LAPARCC tests were based on new questions developed by the PARCC consortium; a committee met to determine how many students scored Basic and Mastery.

Still, White's press release prominently featured a comparison of 2012 to 2016 at the Mastery+ level. The current percentage of students at that level in ELA and Math is up 14 points statewide.

On the other hand, my saved files show that Basic+ dropped by three points overall from 2012 to 2016.

He does not want us to see that Basic+ fell at the same time Mastery+ rose. It would highlight the incompatibility of the 2012 and 2016 measurements.

Since there is no statistical connection between the 2012 and 2016 tests, it is not my point here to claim that our students are actually doing worse. Rather, I am putting forth an example of one way John White ensures that we get only his spin. He makes inconvenient data disappear.

The 2015 LAPARCC was more like a second opinion than a follow-up examination. If you are diagnosed with a terrible disease and then get a second opinion that says you only have a head cold, the second opinion is not evidence that you were miraculously cured.

And our 2015 LAPARCC results were questionable at best. One of the selling points for the switch to the PARCC test was that our scores would be compatible with other states'. It turns out they were not. PARCC tacitly admitted that our students had an advantage.

The official report showed us doing better than Arkansas, New Mexico, and Mississippi, running slightly ahead of Maryland, and on several individual tests, exceeding Illinois. But we were the only state in with a footnote clarifying that only the pencil-and-paper version was administered. Most students in other states took the computer-based test.  Comparisons showed that, overall, students who took the pencil-and-paper version scored higher than expected when compared to similar students taking the computer-based version. White remained silent about this.

Now, the legislature required a new test for 2016 on which fewer than half of the questions would be PARCC questions. Despite our 2015 results being unjustifiably high in comparison to other states, scores rose even higher this year on our new LEAP tests. Whether the gains came from level setting or merely because the tests were shortened and scheduled later, the results were too erratic to take year-to-year comparisons seriously as a sign of actual student improvement.

When a new standardized test is introduced, a committee meets after the test is given. It has raw score data in hand as it makes the judgment calls that ultimately set the curve. Existing tests then have their curves adjusted each year through a statistical process of test equating. The 2015 and 2016 tests introduced new curves and the changes were extreme.

Consider the eighth grade tests. On the 2016 tests, ELA Mastery+ was 27 points above the 2014 level. Math was 18 points higher at Mastery+ but eight points lower at the Basic+ level. These shifts are totally out of line with our testing history.

There are six grade levels of ELA and Math tests - twelve in total. In 2015 and 2016, record year-to-year Mastery+ gains were set on nine of those tests and two tied for record gains when compared with the years 2006 to 2014. At the Basic+ level record losses were set on nine tests and one tied the record.

White's spotlighting of Mastery+ while removing Basic+ data sustains an illusion of improvement.

This goes to show how much changing tests allows the results to be reshaped.

I should note that reshaping the results also invalidates the year-to-year tracking of achievement gaps between subgroups.

Suppose teacher Smith and teacher Jones gave the same test. In teacher Smith's class, several students scored just below an A, but only one actually made an A. In teacher Jones' class, three students made an A, but the next highest grade was a C. By the standard of making A's, Jones' class was better.

Smith and Jones decide to curve the results. The curve raises four more students in Smith's class to an A, but none in Jones' class. The grade distribution now is different; Smith's class has more A's. This change, however, does not show improvement in Smith's class - the raw results did not change. The curving of the grades, not anything the students did, produced the changes.

Now say that they decided not to curve this test, but agreed in advance to curve the next test. On the next test, the highest scorer was still the highest, the second highest remained the second highest, and so on. Since teachers Smith and Jones already agreed to a curve, now Smith's class has more A's than Jones'. But again there was no real change in performance; the results were changed by a curve which affected Smith's and Jones' classes differently.

This would not be evidence Smith's class closed the achievement gap, since the new curve, not a change in student performance, made the difference.

White erred when he compared improvement of districts at Mastery+ from 2012 to 2016 in his press release. While I intend no disrespect to the districts cited, the data better justifies the conclusion that the changes came from the different curves on the different tests rather than real changes in student performance.

John White showed a similar misunderstanding about the reduced number of failing schools when he provided this quote for the release of the 2013 school letter grades (where the results had just been inflated through totally new formulas, and new grading scale, and bonus points).

Changes made to the formula have led to real increases in student achievement,said Superintendent John White.

I documented in part one that the formula change by itself yielded higher letter grades.

So these programmed changes to the tests and formulas are part of a strategy that yields an illusion of improved student achievement. (Let me note here that BESE approved these changes.)
Our grade 3-8 ELA and Math tests have now been completely changed since 2014. A new Social Studies test was field tested this year. Science will follow in 2018-19.

White has now proposed to change the scoring system for our our high school EOC tests in the near future. This, too, can result in a new curve that gives whatever illusion he desires.

White needs history to disappear because it shows how perpetual proficiency rate gains can be created through systematic changes to the tests. He wants us to forget these lessons so that future gains will not be questioned.

The ruse is to change the tests and put students at maximum disadvantage at first, then provide tweaks as necessary to progressively increase student advantage on subsequent administrations.

 III. What does our test history tell us to expect in the future?

Tweaks to the tests and outside influences can increase proficiency rates. To understand this, consider the many ways the volume I hear from my stereo can be raised.

On my home system I can turn the volume knob, choose to play it through different speakers, or adjust the output level of my iPod playing into the system. I can even move closer to the speakers. An adjustment to any one affects what I hear. The original recording is unchanged, but what I perceive is louder.

Likewise, test has a series of "knobs" that can be tweaked to affect the results. As long as the "knobs" are turned down on the first administration, one or more can be turned up on subsequent administrations to continually increase proficiency rates. Arguably, test scores can be made to rise without students truly knowing more at the end of the year.

New-test score-rise effect: Common sense reasons why proficiency rates on new tests rise initially then level off:

Our test history shows that proficiency rates tend to rise for the first three to six years before leveling off. After that, a tweak of some sort or a new test is required to spur increased rates.

Redesigned in 2001 and 2002, the GEE 21 proficiency rates (Basic+) on each subject rose initially, but reached a peak within three to six years. (Aside from a third year stumble in ELA.) The gains between the initial administration and the initial peaks for the four subjects ranged from 8 to 15 points.

Afterward, although there were ups and downs for individual subject tests, overall proficiency peaked in 2004 and slowly declined until 2009. (Initial gains highlighted below:) (click on the figure to enlarge it)

There are some common-sense reasons for the new-test score-rise effect:
·       Teachers become familiar with which material is emphasized on the test and stress it more in class at the expense of covering other material.
·       Teachers/students develop test-taking strategies specific to the new tests over the first few years.
·       Schools may institute test-prep strategies such as trading class time for special test prep periods in the weeks before the test or by requiring struggling students to take remediation in place of other elective classes.
·       The state or district may tie test performance to school letter grades or individual course grades, or the school might provide trips or other rewards for desirable test scores to motivate students to simply try harder on the test days.

The question is, do these things make students smarter overall, or just better prepared and/or more motivated test-takers?

The scheduling of the tests in the school year affects the proficiency rates:

After slowly declining since 2004, the 2009 GEE 21 was scheduled about three weeks later in the school year. Teachers and students had more time to prepare. Proficiency rates rose on three subject tests and one held steady. On average, the increase was four points.

Likewise, from 2006 to 2008, the LEAP/iLEAP Basic+ rates were stagnant; the 2009 tests were moved later in the year and the test schedule was altered to start mid-week so that students no longer had five consecutive testing days without a weekend break. Basic+ performance rose four points, the largest single year gain I could find.
Now, were the students really smarter at the end of that year than previous years, or did taking the tests later and with a weekend break account for the higher scores?

If there is any comparison to be made between the 2016 LEAP and the 2015 LAPARCC tests, the later testing dates and condensed schedule in 2016 would have to be taken into account. This is another reason why any comparison between them is not valid.

Our End-Of-Course (EOC) test history confirms the new-test score-rise effect:

After the GEE 21 proficiency rates stagnated, they were replaced by the EOCs, which were phased in one test per year from 2008 to 2013. The four oldest tests, Algebra I, English II, Geometry, and Biology have peaked and declined since their introduction. English III and US History are still in the period of initial gains.

Scores awarded on the EOC's are Needs Improvement, Fair, Good, and Excellent. Fair+ is the graduation requirement. Good+ is considered proficiency.

Thanks to the new-test score-rise effect, phasing in one test per year ensured long-term overall average proficiency gains. As the first tests hit peaks and began to decline, later tests made gains to offset those losses. (However, overall proficiency now has been unmoved since 2014.) Part of White's spin is averaging the results of several different subject tests to show sustained overall growth.
This explains how the staggered schedule of the redesigns of the LEAP ELA and Math, Science, and Social Studies tests will capitalize on the new-test score-rise effect to produce sustained overall proficiency rate gains.

Test outcomes have been affected by test equating in a curious relationship;

The EOCs have evolved, but the development of the test questions has remained under the control of Louisiana and the testing company, Pacific Metrics. This has allowed test equating to be performed, but some of the outcomes are interesting.

Algebra I hit a relative peak in 2012 at 56 then declined in 2013 by a point. It hit a new peak at 58 this year, but its new peak came along with a lowered raw-score requirement.

While the Good+ rate was up, fewer correct answers were required for Good. (25 out of 50 in 2012; 23 out of 50 were required in 2016).

Similarly, Good + on the Geometry EOC rose every year from 2010 to 2015. Each year from 2011 to 2015, the number of correct answers required for Good was lowered. In 2016, the required number of correct answers was finally left unchanged; Good+ dropped by a point.

Now, there were content changes to the tests at the same time, so the lowered requirements may have been justified. However, it is fair to ask - are the students getting smarter, or is it simply getting easier to make the required scores? Can we be sure that test equating makes the year-to-year comparisons valid?

Student and school-level motivation affects scores:

In fact, the EOC's executive summary does caution against comparing results across years due to policy changes connecting stakes to student performance (see page 6, paragraph 3). The testing company warned that outside factors can indeed influence proficiency rates.

Yet White touted year-to-year overall gains while the scores rose; he remained silent this year as overall Good+ showed no improvement for the third consecutive year.

State policy linked EOC results to students' grades and graduation eligibility beginning in 2010-2011. Needs Improvement rates dropped by nine points that year on each of the three existing tests and have remained close to those levels ever since. Good+ rose by seven or more points for each test that year.

Did these gains actually reflect increased student knowledge or did the students merely take the tests more seriously? What gains should we expect to see if schools develop better ways to motivate students to put forth more serious effort on test days?

Initial proficiency rates may be set for expedience rather than actual student achievement:

When a new test is introduced, the cut scores are set after the test is administered and the raw scores are in. But how do we know what percentage of students should rate Basic or Mastery on a new test?

Committees meet and set the achievement levels using human judgment. So, how do we know that outside influence does not creep in to the score setting process?

The first three EOCs entered with very different initial proficiency rates (Good+). Algebra I - 36 percent; English II - 50 percent; Geometry - 34 percent.

The initial proficiency rates of subsequent tests, however, closely tracked the previous years' average proficiency rates. This would be an excellent strategy to prevent a new test from pulling the average significantly up or down.

The 2010 average proficiency rate of the three tests then in existence - Algebra I, English II, and Geometry - was 43 percent. In 2011, the Biology EOC was introduced. At 43 percent proficiency.

The 2011 average proficiency rate rose to 49 percent in part due to the student motivation discussed above. In 2012, the English III EOC was introduced at 50 percent. Because of the limited number of questions on the test, that was as closes as the new proficiency rate could have been set to match the previous year's average.

The 2012 average proficiency rate was 55 percent. In 2013, the U. S. History EOC was introduced at 53 percent. Again, it was as close as possible to the previous year average.

I have no "smoking gun" to prove that this was intentional, but we have been focused on ever-rising proficiency rates since the inception of No Child Left Behind. It also is consistent with White's method of combining results from different subjects into one average. I find the close correlation to be curious.

This level setting could have been done without penalty to students because the graduation requirement is only Fair, not the Good+ required to earn points toward the school performance scores.

The question it raises is - how much outside influence can be exerted on the setting of the initial proficiency rates when new tests are implemented? To what extent can the initial proficiency rates be shaped to produce a desired political result?

Redefining achievement levels opens the door for outside influence:

The EOC results will be redistributed in the future. LDOE has announced plans to switch from the current four level system to a five level system (Unsatisfactory, Approaching Basic, Basic, Mastery, Advanced). The EOCs are the end of the line of the standardized tests. Students do not need a five level system to predict performance on a future test.

This unnecessary change could raise otherwise stubbornly immovable proficiency rates. Will the raw score required for the new Mastery level match that required for the old Good level? The cut points will have to be redefined, again raising the question of what political influence might be exerted.

Furthermore, the English III EOC will be replaced by an English I EOC, introducing yet another new test and level-setting.

How will these changes affect overall proficiency rates?

Switching from pencil-and-paper tests to computer-based tests will require adjustment for our younger students:

Different formats pose different test-taking challenges, even if the questions are the same. One pencil-and-paper specific strategy is underlining key words or phrases in the test booklet to refer back to. Computer-based math equation editors can be awkward for students to use. I invite teachers to point out more examples in the comments.

It was shown that the PARCC pencil-and-paper version yielded higher scores than the computer-based version nation-wide. However, the current disadvantage of the computer-based test may dissipate as students and teachers become familiar with the format.

It likely will take several computer-based administrations for teachers to fully form new computer-specific strategies and find the best ways to teach then to their students. As students and teachers become more familiar with the format, scores should rise beyond any actual improvement in student learning.

We are scheduled to move to computer-based tests exclusively for grades 5-8 this year. What shifts should we expect in this transition? Should we make allowances for schools that did not administer the computer-based tests in 2016 different than for those that did? What will initially rising proficiency rates really reflect - improved test taking skills or greater knowledge?

Final questions:

History shows that scores will rise on new tests for reasons other than actual improvements in student learning. How much score rise should be required to show that the reforms are working and that students have actually become smarter and are able to do more things?

How can we ensure that level setting and adjusting is untainted by political considerations?

What safeguards should we put in place to ensure that adjustments to the accountability formulas do not falsely promote improved student achievement?

Should we require LDOE to maintain historical records in easy public view to ensure transparency?

And most importantly, given that there are so many ways the measures can be tweaked to produce continued averaged gains, should we continue with a single summative measure of a school at all?

Herb Bassett, Grayson, LA

1 comment:

Anonymous said...

Many of my English III students have expressed concern regarding the computer-based EOC because of the limited numbers of lines of text they can view at a time within the text box. Additionally, many of them use the geography of the traditional page to help them establish landmarks for information they believe is important and may need to reference later. This is in addition to the physical marks they often make on the page to help them make meaning of the text.

Another point regarding student motivation when taking EOC tests is that not all EOCs must be passed. For example, students must pass EITHER the English II OR the English III EOC in order to graduate. As an English III teacher, I can say with a fair degree of certainty that roughly 10% of my English III students each year who have already passed the English II EOC (much thanks to the level setting described in your article) have no intention of putting forth the mental energy on yet ANOTHER standardized test. I have only received two students who have NOT already passed the required ONE EOC for English......TWO.....out of a couple hundred kids.