AI has outscored law professors at answering legal questions

A Stanford-led study found law professors preferred AI answers to peer-written legal responses roughly three times out of four. That is no small result for law schools, legal tech startups, or any firm still treating professional AI as a distant risk.

The uncomfortable part of the study is not that AI did well. Most people watching legal tech expected that. The uncomfortable part is that the margin was so wide, and that the humans judging the answers were law professors themselves.

According to the newly published Stanford Law School paper led by Julian Nyarko, 16 instructors from 14 U.S. law schools took part in a blind evaluation of short-answer tutoring in first-year Contracts. They reviewed anonymized pairs of responses to student-style legal questions and chose which answer they would rather give to a student. The AI responses won at a median rate of 75.81% in human-versus-LLM trials.

That is a serious result because contract law is not a clean multiple-choice exercise. The study used 40 questions across recall of cases or code, doctrinal recall, hypotheticals, and policy. Some questions had clear answers. Others required judgment, weighing competing arguments, or explaining why a legal rule works the way it does. This is exactly the kind of work many professionals assumed would remain protected because it depends on nuance.

The researchers tested whether the AI advantage came from writing polish rather than substance. That matters, because large language models often sound confident even when they are thin on reasoning. The paper looked at answer length, structure, legal anchors, confidence, clarity, reasoning nuance, and pedagogical support. Those surface markers did not fully explain why professors preferred the AI responses.

Gemini 2.5 Pro and NotebookLM were the main systems in the human-judged comparison. NotebookLM outperformed every participating instructor with one tie, while Gemini ranked ahead of NotebookLM in the pooled statistical model. In a scaled model-judged analysis, Claude Opus 4.7, ChatGPT 5.4, and Gemini 2.5 Pro ranked above human instructors on average, with every AI model evaluated beating the instructor group.

The study also found that professors flagged AI answers as pedagogically harmful at much lower rates than peer-written answers. Gemini was flagged at 3.41% and NotebookLM at 3.64%, while human instructors showed wider dispersion, with rates ranging from 1.00% to 39.75%. That does not mean AI tutoring is ready to replace faculty judgment. It does mean the old warning that AI is uniquely risky for students now has to compete with evidence that human answers can be uneven too.

Legal tech investors will notice

For startups, the most important signal is where the models performed well. The questions were not about filing forms or summarizing boilerplate. They were about office-hours style legal explanation, the kind of interaction that sits close to teaching, training, research support, and early advisory work. If AI can answer those questions at a level professors prefer, then the addressable market for legal AI is larger than document automation.

That points directly at products for law firms, in-house legal departments, bar preparation, continuing legal education, and law school support. A legal tech startup no longer has to argue only that AI saves time on low-value tasks. It can argue that well-designed systems may improve access to high-quality legal reasoning, especially for users who cannot get immediate access to a senior lawyer or professor.

There is still a difficult commercial lesson inside the paper. The stock Gemini 2.5 Pro outperformed NotebookLM even though NotebookLM had access to the casebook. A commercial AI tutor built on Gemini 2.5 Pro and grounded in the casebook also ranked below the stock model in the scaled analysis. That should make founders cautious. Adding retrieval, course materials, and product layers does not automatically improve output. Sometimes it adds noise.

This is where legal AI companies will have to be sharper than the pitch deck version of the market. The winners will not simply wrap a model and call it a tutor. They will need tight evaluation, careful prompting, good source handling, and workflows that know when to defer to a human. The opportunity is real, but the product discipline has to be real too.

Law schools cannot ignore this

Law schools and bar associations have been cautious for good reasons. Hallucinated cases, weak citations, student overreliance, and privacy concerns are not imaginary problems. The profession also has duties that do not disappear because a model performs well in a blind comparison. Lawyers still owe competence, confidentiality, and judgment to clients.

But blanket skepticism is becoming harder to defend. If students are already using general-purpose tools, the practical question is not whether AI enters legal education. It already has. The question is whether institutions teach students how to use it well, test them in ways that still measure their own judgment, and update professional rules before the market writes the rules for them.

The next phase will not be a simple fight between professors and machines. It will be about which parts of legal work become cheaper, faster, and more widely available, and which parts become more valuable because they still require human accountability. For legal tech investors, that is the market to watch. For law schools, it is the curriculum problem arriving sooner than expected.

Also read: NeurIPS is facing backlash over AI detector desk rejections • Goldman says Big Tech will spend more on AI infrastructure • Reve 2.0 shows image generation is still open for startups