A new study led by researchers at Wharton and Penn reveals that using generative AI improves student performance, but also makes it harder for students to learn and acquire new skills.
The researchers designed an experiment with nearly 1,000 high school math students in Turkey to determine whether large language models can harm or help their education. One group of students was given GPT Base, a chat interface similar to ChatGPT-4, to help them during practice sessions. A second group was given GPT Tutor, an interface similar to ChatGPT-4 but with safeguards. It includes teacher input and is designed to guide students with hints rather than directly giving answers.
The third group — the control group — had no technology assistance and relied only on traditional resources such as the textbook and notes.
During the AI-assisted practice session, the GPT Base group performed 48% better than the control group. But when AI assistance was taken away from the Base group and they were given an exam on the material, they performed 17% worse than the control group.
“We’ve been really interested in how humans interact with algorithms for a while.”— Hamsa Bastani
The GPT Tutor group performed an astonishing 127% better in the AI-assisted practice session, yet scored about the same on the exam as the control group.
According to the paper, the results suggest that the Base group depended on the software to solve the problems and didn’t learn the underlying mathematical concepts deeply enough to do well on the exam. In contrast, the performance by the Tutor group shows that these harms are mitigated when AI is deployed with teacher-guided conditions and limits.
“We’re really worried that if humans don’t learn, if they start using these tools as a crutch and rely on it, then they won’t actually build those fundamental skills to be able to use these tools effectively in the future,” said Hamsa Bastani, a Wharton professor of operations, information and decisions who co-authored the paper. “As educators, we worry about that.”
Bastani spoke to Wharton Business Daily about the paper, “Generative AI Can Harm Learning.” (Listen to the podcast.) The co-authors are Osbert Bastani, computer and information science professor with Penn Engineering; Alp Sungu, operations, information and decisions professor at Wharton; Haosen Ge, data scientist at the Wharton AI and Analytics Initiative; Özge Kabakcı, math teacher at Budapest British International School; and independent researcher Rei Mariman.
The Generative AI Paradox and How It Impacts Education
The paper’s finding is consistent with similar studies, and Hamsa Bastani said it reflects the paradox of generative AI: It can make tasks easier for people while simultaneously deteriorating their abilities to learn the skills required to solve those tasks.
“We’ve been really interested in how humans interact with algorithms for a while. But I think it gets really interesting with large language models just because of the extent of their reach and the number of people who are using them with such a diversity of tasks,” she said. “One thing that really drew us to this conversation was a lot of teachers are struggling with students copying answers from homework, and they were worried that this would negatively impact their skill-building and their fundamental understanding of concepts. That’s why we decided to dig into this.”
“If we use it sort of lazily and … completely trust the machine learning model, then that’s when we could be in trouble.”— Hamsa Bastani
The study also found that students who used AI assistance were overly optimistic about their learning capabilities, even the high-achieving students. Teachers, on the other hand, seem to be overly concerned and tend to dismiss the advantages of AI. Bastani thinks that’s because students and teachers aren’t yet trained on how to use AI effectively to augment traditional teaching methods.
Bastani and her colleagues said the study is a “cautionary tale” about deploying AI in educational settings, and they remind everyone that the software still has significant limitations. ChatGPT, for example, is known to spit out false information known as hallucinations, which can also potentially harm student learning.
Just like in a workplace setting, generative AI in the classroom still requires a lot of human finesse and fact-checking to make it valuable, Bastani said.
“If we are thinking of this tool as an assistant and doing the higher-level tasks, checking its outputs and so on, it can be a huge benefit,” she said. “But if we use it sort of lazily and kind of outsource the work that we’re supposed to be doing and completely trust the machine learning model, then that’s when we could be in trouble.”