LISTEN TO THIS ARTICLE
title: "AI Agents in Legal: What Works, What Fails, and What the Sanctions Data Actually Shows"
slug: ai-agents-legal-tech-guide
date: 2026-05-01
type: guide
category: real-world-ai
tags: [guides, real-world-ai, enterprise-deployment, legal-tech]
status: draft
In June 2023, attorneys Steven Schwartz and Peter LoDuca submitted a brief in a federal case citing six cases that did not exist. ChatGPT had invented them. When the opposing party asked for copies, the attorneys submitted fabricated pages. A judge sanctioned them $5,000 and required them to personally notify each judge whose name had been forged onto fictional rulings.
That case, Mata v. Avianca, is now the canonical warning label on AI in legal work. The problem is that it has been two years since that ruling, and the warning has not slowed adoption. As of October 2025, a database maintained by Arizona State University's Cronkite News had documented 486 cases of AI-generated hallucinated citations in legal filings (324 in U.S. courts alone), with at least 200 new incidents recorded in 2025. Two judges are among the 128 implicated parties.
At the same time, 74,000 attorneys are on Harvey AI's platform. Corporate legal departments' AI adoption rate jumped from 23% to 52% in a single year. The legal AI market doubled from $1.5 billion in 2024 to over $3 billion in 2025.
Both narratives are true. The question for any legal team evaluating AI agents is which one applies to their situation — and whether the difference is about the task, the tool, or the workflow.
The Legal AI Stack in 2026
The tools that law firms and legal departments actually use today fall into three categories:
Specialized legal AI platforms. Harvey AI, Thomson Reuters CoCounsel (built on Casetext), and LexisNexis Protégé sit in this tier. They integrate with proprietary legal databases, are trained on case law and contract data, and are sold with enterprise data-handling agreements. CoCounsel's agentic workflows can complete tasks that "previously took an hour in five minutes or less," according to Thomson Reuters case studies. LexisNexis Protégé deploys four specialized agents (an orchestrator, a legal research agent, a web search agent, and a customer document agent) that work together on complex multi-step tasks.
Task-specific document tools. Kira (now Litera) is trained on 1,400 contract provision types and claims 90% or higher accuracy on clause extraction, refined over a decade of proprietary training data. Relativity aiR handles eDiscovery document review. Everlaw processes up to 900,000 documents per hour for upload and search. These tools are narrower than the platform plays, but they have longer track records in specific workflows.
General-purpose AI tools. ChatGPT remains the most-used AI tool among legal professionals: the ABA's 2024 Technology Survey found that 52.1% of lawyers who use AI cite ChatGPT as their primary tool, compared to 26% for CoCounsel and 24.3% for Lexis+ AI. This is where most of the sanctions cases originate.
A&O Shearman, one of the ten largest law firms in the world, was Harvey's launch partner. Its 3,500 lawyers across 43 offices ran 40,000 queries during the initial trial phase. In 2025, the firm announced a second phase targeting complex agentic workflows: multi-step tasks requiring sustained reasoning across large document sets.
JP Morgan has 450 or more agentic AI workflows running daily in its legal and compliance operations. Norm AI launched an AI-native law firm in 2025 where licensed attorneys supervise agents that execute the substantive legal work.
What Legal AI Agents Actually Do
The ABA survey found that document review and eDiscovery is the most common use case among firms that have adopted AI (77% of AI users). That makes sense: document review is high-volume, repetitive, and expensive. An experienced associate billing $400 per hour reviewing 300 documents a day is an obvious automation target.
Here is a task-by-task breakdown of where AI is replacing human labor, where it is assisting, and where it remains genuinely unreliable.
Contract Review and Extraction
What works: Clause extraction and comparison against standard templates. Kira's 90%+ accuracy on standard provisions reflects the state of the art for trained, narrow models. On the Contract Understanding Atticus Dataset (CUAD), the best models reach 95-97% accuracy on clause identification. These numbers hold up reasonably well for standard commercial contracts.
What doesn't: Materiality assessment. An AI can flag that a limitation of liability clause is absent or non-standard; it cannot tell you whether that absence matters given the client's risk tolerance, the counterparty's creditworthiness, or the deal dynamics. That judgment layer remains human work.
Practical ceiling: LinkSquares reports customers cutting NDA processing time by approximately 400% and outside counsel costs by 40%. The ACC finds that AI-assisted review is 3-5x faster than manual review for typical contracts. These figures are credible for standardized, high-volume agreements. For complex bespoke agreements, gains are smaller and the risk of missed nuance is higher.
Legal Research
What works: Retrieving cases, locating statutes, summarizing holdings, identifying circuit splits, and mapping citation networks. CoCounsel's case studies consistently show research tasks compressed from an hour to under ten minutes. Harvey processes 700,000 or more tasks daily across its platform, and legal research is the highest-volume category.
What doesn't: Novel questions. When a legal question lacks direct precedent (a first-impression issue, a cross-jurisdictional regulatory question, an emerging technology dispute), AI performance degrades fast. The model is pattern-matching against existing law; when the pattern doesn't exist, the output becomes unreliable. This is the exact scenario in which Mata v. Avianca happened: the attorney was looking for cases in a niche area of aviation law, and ChatGPT generated plausible-sounding results for cases that did not exist.
eDiscovery and Document Review
What works: Relevance coding, privilege flagging, deduplication, document clustering, and issue tagging. Everlaw and Relativity operate at scales no human team can match. For productions involving hundreds of thousands of documents, AI first-pass review followed by attorney spot-checking is now standard practice at large firms.
What doesn't: Final privilege calls and context-dependent responsiveness decisions. AI can flag a document as potentially privileged; only an attorney can confirm that the privilege applies and decide whether to withhold or produce. Errors in this area can result in inadvertent waiver.
Due Diligence
What works: Systematic extraction of defined terms, change of control provisions, assignment restrictions, and key dates across large contract volumes. In M&A transactions where a target may have thousands of vendor, customer, and employment agreements, AI can produce a structured issues summary faster than a team of associates.
What doesn't: Significance weighting. The AI will flag every indemnification clause; it cannot rank them by business risk exposure. It will identify every termination right; it cannot tell you which ones will actually be exercised by the counterparty.
The Accuracy Math You Need to Do
A 95% accuracy rate on contract clause extraction sounds impressive until you run the numbers at scale.
A typical M&A due diligence exercise might involve 1,500 agreements. At 20 clauses per agreement reviewed, that's 30,000 data points. At 95% accuracy, 1,500 of those data points are wrong. At 97%, the number is 900 errors. In a transaction context where a single missed assignment restriction can block a deal, this is not an acceptable error rate without attorney review of the flagged items.
The accuracy figures in vendor marketing materials are measured against standardized benchmark datasets, not against audited production results. Independent, peer-reviewed accuracy benchmarks for production legal AI deployments are rare. The Cronkite News database of 486 hallucination incidents is the closest thing to independent performance data available, and it covers only the cases that resulted in court filings and sanctions: the visible tip of the failure distribution.
This does not mean the tools are not useful. It means the right workflow is AI-assisted, not AI-autonomous. AI reduces the volume of material that needs attorney attention; it does not eliminate the need for attorney attention.
Privilege, Confidentiality, and Professional Responsibility
These three constraints are structural, not temporary.
Privilege: The International Bar Association has formally flagged that sharing client materials with public AI tools may constitute disclosure to a "digital stranger," which could waive attorney-client privilege. Privilege can be lost the moment material is shared with a party not covered by the privilege relationship. Enterprise AI deployments with proper data-processing agreements substantially reduce this risk; consumer tools (the 52% using ChatGPT) do not.
Confidentiality: Model providers' terms of service vary significantly on data retention and training use. Lawyers using general-purpose tools without enterprise agreements may be inadvertently training future model versions on client data. This is a compliance question in every jurisdiction with active bar rules on confidentiality.
Professional responsibility: This is the hardest constraint. Competency obligations are non-delegable. A lawyer cannot shift blame for a hallucinated citation to the AI tool. The ABA Model Rules and state equivalents require attorneys to supervise all work product, understand the tools they use (including their limitations), and verify all factual and legal assertions before filing. Twenty-one firms had formal AI adoption policies as of 2025 reporting; the other 79% are exposed.
Where to Start: Highest-ROI, Lowest-Risk Entry Points
Not all legal AI is equally risky. The highest-ROI, lowest-risk entry points share two properties: the output is verifiable, and errors are catchable before they cause harm.
NDA and standard agreement review. Volume is high, variation is limited, and templates give reviewers a clear baseline. Errors in first-pass review are caught in subsequent attorney review. This is the clearest win case.
eDiscovery first-pass coding. The stakes are high enough that human review of AI output is standard practice anyway. AI accelerates the process; attorneys retain final authority. Firms that have standardized on Relativity aiR or Everlaw report substantial reductions in review hours without corresponding increases in error rates.
Contract data extraction for CLM systems. Extracting dates, parties, and defined terms for entry into contract lifecycle management databases is mechanical work. AI handles this well; the extracted data is auditable before it enters the system.
Legal research summarization. Use AI to locate and summarize cases; use a lawyer to verify citations against the primary source before any filing. The verification step costs minutes and eliminates the sanctions risk entirely.
The Mata-type cases share a different profile. A practitioner in a niche area, under time pressure, using a consumer tool without enterprise safeguards, who treats the AI's output as a finished work product. The tool, the task, and the workflow all fail together.
See When NOT to Use an Agent for the broader pattern of how automation failures cluster in exactly these high-pressure, low-oversight scenarios.
The Agentic Turn and Its New Risks
Everything above describes what might be called first-generation legal AI: tools that complete discrete tasks in response to discrete prompts. CoCounsel summarizes a document. Kira extracts a clause. Harvey answers a research question.
Agentic legal AI is different in kind. A&O Shearman's 2025 agentic rollout targets complex multi-step workflows: a single instruction triggers an agent that plans a sequence of subtasks, executes them across multiple document sets and data sources, and returns a synthesized output. LexisNexis Protégé's four-agent architecture is the production version of this approach.
Multi-step agent chains introduce failure modes that single-step tools do not have:
Compounding errors. An error in step two of a five-step workflow propagates through the remaining steps. In a research task, a misidentified case in the initial search shapes every subsequent analytical step.
Harder audit trails. When a lawyer reviews a document summary, the review path is clear. When an agent has orchestrated ten subtasks across two databases and three document collections, tracing why a specific conclusion was reached requires examining the full chain, and most current agent frameworks do not make that easy.
Out-of-scope actions. Autonomous agents that take actions in external systems (filing, sending communications, updating databases) create exposure that read-only tools do not. The legal sector has not yet produced binding professional responsibility guidance on autonomous agent actions; the liability question is open.
Agent Reliability Scores Are Getting Worse, Not Better covers the general reliability picture across production agent deployments. The legal context adds professional responsibility exposure to every failure mode.
The Insurance Gap
Professional liability policies were written before AI was a significant legal tool. Most do not explicitly address AI-generated errors, and insurers have begun exploiting that ambiguity.
The ABA Journal's 2025 analysis found that insurers are adding AI exclusions, verification requirements, and competency maintenance clauses to new policies. A lawyer who uses AI without adequate verification and faces a malpractice claim may find their policy does not cover the loss. The threshold question is whether the lawyer exercised competent supervision of the AI output, and it is the same question courts are already answering in sanctions proceedings.
This is not a reason to avoid AI. It is a reason to document the verification workflow, maintain clear records of which outputs were AI-assisted and what review they received, and check current policy terms before expanding AI use.
What's Next in Legal AI
The direction is clear. Harvey's $100 million in annual recurring revenue and 4x year-on-year user growth reflect a category that has moved past early adoption. The ABA survey's jump from 11% to 30% AI adoption in a single year, with large firms approaching 50%, means the profession is already stratifying between AI-equipped practitioners and those who aren't.
The leading indicator is Norm AI's AI-native law firm model: licensed attorneys supervising agents that execute the substantive work. This is not a prediction about where most law firms will be in five years. It is the architecture that the most productive legal operations will converge on, because the economics are compelling. An attorney supervising ten concurrent agent workflows is providing coverage that would require a team of associates under traditional staffing models.
The firms that will get this right are the ones that treat AI output as evidence to be verified, not conclusions to be filed, and that build workflows making verification fast enough that it happens every time. The 486 sanctions cases document what happens when verification is skipped. The Harvey and CoCounsel adoption numbers document what happens when it is not.
The gap between those two outcomes is workflow design, not model capability.
See Deploying AI Agents to Production for the general framework, and AI Agent ROI: The Calculator and Framework That Cuts Through Vendor Math for how to evaluate the cost structure before committing to a legal AI platform.
Sources: Harvey AI; A&O Shearman agentic AI announcement; ABA 2024 Technology Survey — LawSites; 486 AI hallucination cases — Cronkite News/ASU; Mata v. Avianca sanctions — ABA Journal; Kira contract AI — Litera; Relativity aiR; Thomson Reuters CoCounsel; IBA on privilege and AI; LexisNexis Protégé; California $10,000 sanction — CalMatters; Lawyers dinged for opponents' fake citations — LawNext; AI malpractice insurance gaps — ABA Journal; DLA Piper on agentic AI risks; Enterprise AI pilot failure rate context