What happens when law students go head-to-head with GenAI?

September 24, 2024

What happens when law students go head-to-head with GenAI?

New research reveals academics should not be overly confident in their ability to identify AI-generated work

Can generative artificial intelligence (GenAI) outsmart 90 per cent of law students? It’s a claim OpenAI made in 2023 when it announced ChatGPT-4 scored higher than 90 per cent of human test takers on a simulated version of the US bar exam.

It’s an interesting concept and something that Dr Armin Alimardani, a lecturer in Law and Emerging Technologies at the University of Wollongong School of Law, and a consultant at OpenAI, wanted to investigate further.

His findings were surprising and form the basis of his new paper, Generative Artificial Intelligence vs. Law Students: An Empirical Study on Criminal Law Exam Performance, published today in the Journal of Law, Innovation and Technology.

“The OpenAI claim was impressive and could have significant implications in higher education, for instance, does this mean students can just copy their assignments into generative AI and ace their tests?” Dr Alimardani said.

But Dr Alimardani, a lecturer in Law and Emerging Technologies in UOW’s School of Law, and consultant at OpenAI, had his doubts about OpenAI’s claims.

“Many of us have played around with generative AI models and they don’t always seem that smart, so I thought why not test it out myself with some experiments.”

The experiment

In the second semester of 2023 Dr Alimardani was the subject coordinator of Criminal Law. He prepared the end of semester exam question and generated five AI answers using different versions of ChatGPT. He generated another five AI answers using a variety of prompt engineering techniques to enhance their responses.

“My research assistant and I hand wrote the AI generated answers in different exam booklets and used fake student names and numbers. These booklets were indistinguishable from the real ones,” Dr Alimardani said.

After the Criminal Law exam was held at the end of the semester Dr Alimardani mixed the AI generated papers with the real student papers and handed them to tutors for grading.

“Each tutor unknowingly received and marked two AI papers and my mission impossible was accomplished.”

The results

Dr Alimardani said the exam was marked out of 60 and 225 students took the test. He said the average mark was about 40 (around 66 per cent).

“For the first lot of AI papers which didn’t use any special prompt techniques only two barely passed and the other three failed,” Dr Alimardani said.

“The best performing paper was only better than 14.7 per cent of the students. So this small sample suggests that if the students simply copied the exam question into one of the OpenAI models, they would have a 50 per cent chance of passing.”

The other five papers that used prompt engineering tricks performed better.

“Three of the papers weren’t that impressive but two did quite well. One of the papers scored 44 (about 73 per cent) and the other scored 47 (about 78 per cent).” Dr Alimardani explained.

“Overall, these results don’t quite match the glowing benchmarks from OpenAI’s United States bar exam simulation and none of the 10 AI papers performed better than 90 per cent of the students.”

What does this mean for students and educators?

Dr Alimardani said none of the tutors suspected any papers were AI generated and were genuinely surprised when they found out.

“Three of the tutors admitted that even if the submissions were online, they wouldn’t have caught it. So if academics think they can spot an AI generated paper, they should think again.”

The other issue Dr Alimardani initially thought may come up is ‘hallucination’ – the presence of fabricated information. However, the research found the models performed well at staying on track with legal principles and facts provided in the exam.

Instead Dr Alimardani said the real problem was ‘alignment’ – the degree to which AI -generated outputs match the user’s intentions.

“The AI generated answers weren’t as comprehensive as we expected. It seemed to me that the models were fine tuned to avoid hallucination by playing it safe and providing less detailed answers,” Dr Alimardani said.

“My research shows that people can’t get too excited about the performance of GenAI models in benchmarks. The reliability of benchmarks may be questionable and the way they evaluate models could differ significantly from how we evaluate students.”

But Dr Alimardani said his findings imply that graduates who know how to work with AI could have an advantage in the job market.

“Prompt engineering can significantly enhance the performance of GenAI models, and therefore it is more likely that future employers would have higher expectations regarding students' GenAI proficiency.

“It’s likely students will be increasingly assessed on their ability to collaborate with AI to complete tasks more efficiently and with higher quality.”

Dr Alimardani said his research has wider implications for educators.

“Law schools and educators from other disciplines must focus on developing critical analysis skills to help students collaborate more effectively with AI, but it’s ultimately up to universities to make sure students learn the necessary knowledge and skills before they collaborate with AI.”

About the research

Generative Artificial Intelligence vs. Law Students: An Empirical Study on Criminal Law Exam Performance, by Dr Armin Alimardani was published today in the Journal of Law, Innovation and Technology

What happens when law students go head-to-head with GenAI?

What happens when law students go head-to-head with GenAI?

The experiment

The results

What does this mean for students and educators?

About the research

You may also be interested in