Computers scoring Texas students’ STAAR essay answers, state officials say

Texas students’ written responses on the STAAR test will most likely be scored by a computer, rather than a person.

Some education leaders are confused by the change and question how using this technology to assess essays will impact students and teachers. State officials say this system is not the same as the generative artificial intelligence that powers programs like ChatGPT, but a tool with narrow abilities that can improve scoring efficiency.

The Texas Education Agency quietly rolled out a new model for evaluating student answers on the State of Texas Assessments of Academic Readiness, or STAAR, in December. Roughly three-quarters of written responses are scored by an “automated scoring engine.”

Officials emphasized that these engines don’t learn beyond a single question and are programmed to emulate how humans would score an essay. The computer determines how to assess written answers after analyzing thousands of students’ responses that were previously scored by people.

The automated scoring engine is “programmed by humans, overseen by humans, and is analyzed at the end by humans,” said Jose Rios, director of the student assessment division.

The new scoring method comes amid a broader STAAR redesign. There’s now a cap on multiple choice questions. The new test – which launched last year – includes essays at every grade level.

“These questions are very time consuming and laborious to score,” Rios said. “At the same time, you need to balance that with our commitment to produce results as fast as we can for districts. We needed to find a way to be more efficient.”

Agency officials estimated the new test format would require four to five times the number of human scorers, costing an extra $15 million to $20 million per year if they were to exclusively use people.

Only about one in four student responses will end up in front of a person’s eyes.

The rollout confused some education leaders, who said agency officials could have announced the move with more transparency.

“At the very least, they should do a pilot or study for a pretty long time,” State Board of Education member Pat Hardy, R-Fort Worth, said. “It’s an area that needs more exploration. … It just seems so cold.”

Dallas schools Superintendent Stephanie Elizalde said she only recently learned about the change and is left with questions about how the system was created and potential biases within it.

“It’s the same lesson, certainly, that I learn and want to improve on: That the more information we provide our communities, the better trusting relationships we build,” Elizalde said.

TEA officials say a technical report, with a detailed overview of the system, will be available later this year.

Other states have used this type of model for years, though not without criticism. In Ohio, for example, some districts said they spotted irregularities after tests were scored by computers, The Plain Dealer reported in 2018.

Cleveland’s newspaper reported that district officials began asking questions about scoring after a larger-than-expected number of student answers received zero points.

A similar question emerged in Texas, where large numbers of high schoolers received zeroes during the most recent STAAR test. State officials insist the automated scoring engines are not the reason for this.

Agency officials say they’re confident in their program.

The scoring engines were “successful in recreating the Spring 2023 results and were shown to be as accurate as human scorers,” they said. TEA officials did not provide information showing this result to The Dallas Morning News.

Humans validate about 25% of the answers scored by the computer. Essays are routed to human scorers based on certain conditions or if the scoring engine expresses low confidence about its determination. Random tests are also audited.

“The low confidence responses are often those responses that are on the border between two score points,” according to a state document outlining the method. “The purpose of this routing is to ensure that unusual or borderline responses receive fair and accurate scores.”

All Spanish STAAR tests are scored by people. Agency officials said their automated scoring engines don’t work for languages other than English.

Les Perelman, a former Massachusetts Institute of Technology associate dean and longtime critic of automated essay scoring, said this differentiation concerns him.

“It’s inherently unequal,” he said.

The shift in scoring STAAR tests is part of a larger conversation about the role of technology in classrooms. How can teachers catch students using ChatGPT to write essays? What could be accomplished with AI tutors?

“Where I really see AI going in education is moving towards: how do we give really timely, useful feedback that is going to allow students to learn better?” said Peter Foltz, a University of Colorado, Boulder professor and director of the Institute for Student-AI Teaming.

Teacher concerns

Some educators were taken aback by the quiet introduction of this new scoring method. Schools are graded on the state’s academic accountability system largely based on how students perform on STAAR.

During the latest round of STAAR testing in the fall, a huge number of high school students scored poorly on the written questions. Roughly 8 in 10 written responses on the English II End of Course exam received zero points.

In the spring – the first iteration of the redesigned test, but scored only by humans – roughly a quarter of responses scored zero points in the same subject.

Many students who take STAAR in the fall are “re-testers” who did not meet grade level on a previous test attempt. Spring testers tend to perform better, according to agency officials who were asked to explain the spike in low scores in the fall.

Chris Rozunick, the director of the state’s assessment development division, said she understands why people connect the spike in zeroes to the rollout of automated scoring based on the timing. But she insists that the two are unrelated.

“It really is the population of testers much more than anything else,” Rozunick said.

Observers’ skepticism may be fueled by Texas’ previous problems with STAAR technology.

In 2016, thousands of Texas students had difficulties logging in and staying online during writing exams, prompting state officials to void those results.

Testing vendor ETS was fined $5.7 million for damages by the TEA and ordered to spend more than $15 million on improvements to its online system and test shipping.

In 2018, the state was forced to throw out 71,000 online STAAR exams after server problems caused crashes during April and May testing windows.

Three years later, the state saw more technology flare-ups during testing, with students in various districts kicked out of tests and unable to log back in. ETS’ contract ended that year.

TEA officials said they worked with their assessment vendors, Cambium and Pearson, to develop the automated scoring engines.

Perelman said one of his concerns with the trend toward machine scoring is that it “teaches students to be bad writers,” with teachers incentivized to instruct children on how to write to a computer rather than to a human. The problem, he said, is machines are “really stupid” when it comes to ideas.

He previously made waves when he and others developed the “BABEL Generator,” which spit out incoherent essays that scored well when evaluated by automated scoring engines about a decade ago.

Texas Education Agency officials said their scoring engines look for anomalies, such as if a student has not responded in English or wrote an answer of “unexpected length.” Those responses are sent for human scoring.

Foltz said automated scorers must be built with strong guardrails. He added that it’s not easy to coach students on how to game a scoring engine.

Texas’ standard of checking roughly 25% of essays with a human scorer provides “a pretty good margin of safety … to know that things are working,” he added.

More information about how Texas’ new system operates is expected in the coming months. Education observers are likely to look for whether the system appears biased toward any student group.

“There are also human biases there, and the computers may learn any biases that the humans may have,” Foltz said. “If there’s certain kinds of phrases that humans value more, the computer will tend to pick up on that.”

Staff writer Ari Sen contributed to this article.

The DMN Education Lab deepens the coverage and conversation about urgent education issues critical to the future of North Texas.

The DMN Education Lab is a community-funded journalism initiative, with support from Bobby and Lottye Lyle, Communities Foundation of Texas, The Dallas Foundation, Dallas Regional Chamber, Deedie Rose, Garrett and Cecilia Boone, The Meadows Foundation, The Murrell Foundation, Solutions Journalism Network, Southern Methodist University, Sydney Smith Hicks and the University of Texas at Dallas. The Dallas Morning News retains full editorial control of the Education Lab’s journalism.

Computers scoring Texas students’ STAAR essay answers, state officials say

Teacher concerns

Russian court extends detention of Wall Street Journal reporter Gershkovich until end of January

Russian court extends detention of Wall Street Journal reporter Evan Gershkovich, arrested on espionage charges

Israel's economy recovered from previous wars with Hamas, but this one might go longer, hit harder

Stock market today: Asian shares mixed ahead of US consumer confidence and price data

EXCLUSIVE: ‘Sister Wives' star Christine Brown says her kids' happy marriages inspired her leave Kody Brown

NBA fans roast Clippers for losing to Nuggets without Jokic, Murray, Gordon

Panthers-Senators brawl ends in 10-minute penalty for all players on ice

CNBC Daily Open: Is record Black Friday sales spike a false dawn?

Freed Israeli hostage describes deteriorating conditions while being held by Hamas

High stakes and glitz mark the vote in Paris for the 2030 World Expo host

Biden’s unworkable nursing rule will harm seniors

Jalen Hurts: We did what we needed to do when it mattered the most

LeBron James takes NBA all-time minutes lead in career-worst loss

Vikings' Kevin O'Connell to evaluate Josh Dobbs, path forward at QB

OTHER NEWS

Lawsuit seeks $16 million against Maryland county over death of pet dog shot by police

Heidi Klum shares rare photo of all 4 of her and Seal's kids

European stocks head for flat open as markets struggle to find momentum

Linda C. Black Horoscopes: November 28

Michigan Democrats poised to test ambitious environmental goals in the industrial Midwest

Gaza Is Falling Into ‘Absolute Chaos,’ Aid Groups Say

Bereaved Israeli and Palestinian families to march together in anti-hate vigil