In June 2023, attorneys Steven Schwartz and Peter LoDuca submitted a brief in a federal case citing cases that did not exist. ChatGPT had invented them. When the opposing party asked for copies, the attorneys submitted fabricated pages. The sanctions order in Mata v. Avianca documents that failure and imposed a penalty with a requirement to notify the affected judges.

The CSV download from Damien Charlotin's AI Hallucination Cases database, last updated June 11, 2026, lists well over a thousand cases so far. Adoption has continued alongside the warnings, and that database shows the problem is not isolated to a single filing. Vendor case studies also report strong usage figures, but those should be treated as vendor-reported examples rather than universal adoption rates.

Both narratives are true. The question for any legal team evaluating AI agents is which one applies to their situation — and whether the difference is about the task, the tool, or the workflow.


The tools that law firms and legal departments actually use today fall into three categories:

Specialized legal AI platforms. Harvey AI, Thomson Reuters CoCounsel (built on Casetext), and LexisNexis Protégé sit in this tier. Harvey describes itself as AI software for legal and professional services, and A&O Shearman said its Harvey rollout targets complex legal workflows. Vendor case studies describe productivity gains on selected tasks, but those claims should be evaluated as vendor-reported examples rather than universal performance guarantees. LexisNexis describes Protégé as using multiple specialized agent roles for complex research and document workflows.

Task-specific document tools. Kira (now Litera), Relativity aiR, and Everlaw are narrower than the platform plays, but they have longer track records in specific workflows such as contract review and eDiscovery. Vendor materials cite high throughput and accuracy in defined settings; teams should validate those claims on their own matter types before relying on them.

General-purpose AI tools. ChatGPT remains prominent in legal AI usage surveys, while CoCounsel and Lexis+ AI appear as common specialist alternatives. This general-purpose tier is where many of the sanctions cases originate, because consumer tools are easier to use without legal-specific safeguards.

A&O Shearman was an early Harvey partner. The firm later announced a second phase targeting complex agentic workflows: multi-step tasks requiring sustained reasoning across large document sets.

JP Morgan has many agentic AI workflows running daily in its legal and compliance operations. Norm AI launched an AI-native law firm where licensed attorneys supervise agents that execute the substantive legal work.


The ABA survey found that document review and eDiscovery is the most common use case among firms that have adopted AI. That makes sense: document review is high-volume, repetitive, and expensive. An experienced associate billing high hourly rates while reviewing large document volumes is an obvious automation target.

Here is a task-by-task breakdown of where AI is replacing human labor, where it is assisting, and where it remains genuinely unreliable.

Contract Review and Extraction

What works: Clause extraction and comparison against standard templates. Kira's reported accuracy on standard provisions reflects the state of the art for trained, narrow models. On the Contract Understanding Atticus Dataset (CUAD), the best models reach high clause-identification accuracy. These results hold up reasonably well for standard commercial contracts.

What doesn't: Materiality assessment. An AI can flag that a limitation of liability clause is absent or non-standard; it cannot tell you whether that absence matters given the client's risk tolerance, the counterparty's creditworthiness, or the deal dynamics. That judgment layer remains human work.

Practical ceiling: LinkSquares reports customers cutting NDA processing time and outside counsel costs substantially. The ACC finds that AI-assisted review is multiple times faster than manual review for typical contracts. These figures are credible for standardized, high-volume agreements. For complex bespoke agreements, gains are smaller and the risk of missed nuance is higher.

What works: Retrieving cases, locating statutes, summarizing holdings, identifying circuit splits, and mapping citation networks. CoCounsel's case studies consistently show research tasks compressed from an hour to a much shorter workflow. Harvey processes very high daily task volumes across its platform, and legal research is the highest-volume category.

What doesn't: Novel questions. When a legal question lacks direct precedent (a first-impression issue, a cross-jurisdictional regulatory question, an emerging technology dispute), AI performance degrades fast. The model is pattern-matching against existing law; when the pattern doesn't exist, the output becomes unreliable. This is the exact scenario in which Mata v. Avianca happened: the attorney was looking for cases in a niche area of aviation law, and ChatGPT generated plausible-sounding results for cases that did not exist.

eDiscovery and Document Review

What works: Relevance coding, privilege flagging, deduplication, document clustering, and issue tagging. Everlaw and Relativity operate at scales no human team can match. For productions involving hundreds of thousands of documents, AI first-pass review followed by attorney spot-checking is now standard practice at large firms.

What doesn't: Final privilege calls and context-dependent responsiveness decisions. AI can flag a document as potentially privileged; only an attorney can confirm that the privilege applies and decide whether to withhold or produce. Errors in this area can result in inadvertent waiver.

Due Diligence

What works: Systematic extraction of defined terms, change of control provisions, assignment restrictions, and key dates across large contract volumes. In M&A transactions where a target may have thousands of vendor, customer, and employment agreements, AI can produce a structured issues summary faster than a team of associates.

What doesn't: Significance weighting. The AI will flag every indemnification clause; it cannot rank them by business risk exposure. It will identify every termination right; it cannot tell you which ones will actually be exercised by the counterparty.


The Accuracy Math You Need to Do

A high accuracy rate on contract clause extraction sounds impressive until you run the numbers at scale.

A typical M&A due diligence exercise can involve a very large number of agreements and clause checks. Even a small miss rate can leave many errors on the table, and in a transaction context where a single missed assignment restriction can block a deal, that is not acceptable without attorney review of the flagged items.

The accuracy figures in vendor marketing materials are measured against standardized benchmark datasets, not against audited production results. Independent, peer-reviewed accuracy benchmarks for production legal AI deployments are rare. The Cronkite News database of hallucination incidents is the closest thing to independent performance data available, and it covers only the cases that resulted in court filings and sanctions: the visible tip of the failure distribution.

This does not mean the tools are not useful. It means the right workflow is AI-assisted, not AI-autonomous. AI reduces the volume of material that needs attorney attention; it does not eliminate the need for attorney attention.


Privilege, Confidentiality, and Professional Responsibility

These three constraints are structural, not temporary.

Privilege: The International Bar Association has formally flagged that sharing client materials with public AI tools may constitute disclosure to a "digital stranger," which could waive attorney-client privilege. Privilege can be lost the moment material is shared with a party not covered by the privilege relationship. Enterprise AI deployments with proper data-processing agreements substantially reduce this risk; consumer tools do not.

Confidentiality: Model providers' terms of service vary significantly on data retention and training use. Lawyers using general-purpose tools without enterprise agreements may be inadvertently training future model versions on client data. This is a compliance question in every jurisdiction with active bar rules on confidentiality.

Professional responsibility: This is the hardest constraint. Competency obligations are non-delegable. A lawyer cannot shift blame for a hallucinated citation to the AI tool. The ABA Model Rules and state equivalents require attorneys to supervise all work product, understand the tools they use (including their limitations), and verify all factual and legal assertions before filing. Some firms have formal AI adoption policies; others still need to build them.


Where to Start: Highest-ROI, Lowest-Risk Entry Points

Not all legal AI is equally risky. The highest-ROI, lowest-risk entry points share two properties: the output is verifiable, and errors are catchable before they cause harm.

NDA and standard agreement review. Volume is high, variation is limited, and templates give reviewers a clear baseline. Errors in first-pass review are caught in subsequent attorney review. This is the clearest win case.

eDiscovery first-pass coding. The stakes are high enough that human review of AI output is standard practice anyway. AI accelerates the process; attorneys retain final authority. Firms that have standardized on Relativity aiR or Everlaw report substantial reductions in review hours without corresponding increases in error rates.

Contract data extraction for CLM systems. Extracting dates, parties, and defined terms for entry into contract lifecycle management databases is mechanical work. AI handles this well; the extracted data is auditable before it enters the system.

Legal research summarization. Use AI to locate and summarize cases; use a lawyer to verify citations against the primary source before any filing. The verification step costs minutes and eliminates the sanctions risk entirely.

The Mata-type cases share a different profile. A practitioner in a niche area, under time pressure, using a consumer tool without enterprise safeguards, who treats the AI's output as a finished work product. The tool, the task, and the workflow all fail together.

See When NOT to Use an Agent for the broader pattern of how automation failures cluster in exactly these high-pressure, low-oversight scenarios.

The same architecture lesson shows up outside legal work: The Agent Project That Should Have Been One LLM Call explains when verification, permissions, and deterministic workflow control should beat autonomy.


The Agentic Turn and Its New Risks

Everything above describes what might be called first-generation legal AI: tools that complete discrete tasks in response to discrete prompts. CoCounsel summarizes a document. Kira extracts a clause. Harvey answers a research question.

Agentic legal AI is different in kind. A&O Shearman's agentic rollout targets complex multi-step workflows: a single instruction triggers an agent that plans a sequence of subtasks, executes them across multiple document sets and data sources, and returns a synthesized output. LexisNexis Protégé's multi-agent architecture is the production version of this approach.

Multi-step agent chains introduce failure modes that single-step tools do not have:

Compounding errors. An error in step two of a five-step workflow propagates through the remaining steps. In a research task, a misidentified case in the initial search shapes every subsequent analytical step.

Harder audit trails. When a lawyer reviews a document summary, the review path is clear. When an agent has orchestrated ten subtasks across two databases and three document collections, tracing why a specific conclusion was reached requires examining the full chain, and most current agent frameworks do not make that easy.

Out-of-scope actions. Autonomous agents that take actions in external systems (filing, sending communications, updating databases) create exposure that read-only tools do not. The legal sector has not yet produced binding professional responsibility guidance on autonomous agent actions; the liability question is open.

The Accountability Gap When AI Agents Act covers that liability problem outside legal work, where vendors, deployers, and users all have incentives to point elsewhere. If the agent stores matter context across sessions, How Agent Memory Got an Architecture is the engineering side of the same supervision problem.

Agent Reliability Scores Are Getting Worse, Not Better covers the general reliability picture across production agent deployments. The legal context adds professional responsibility exposure to every failure mode.


The Insurance Gap

Professional liability policies were written before AI was a significant legal tool. Most do not explicitly address AI-generated errors, and insurers have begun exploiting that ambiguity.

The ABA Journal's analysis found that insurers are adding AI exclusions, verification requirements, and competency maintenance clauses to new policies. A lawyer who uses AI without adequate verification and faces a malpractice claim may find their policy does not cover the loss. The threshold question is whether the lawyer exercised competent supervision of the AI output, and it is the same question courts are already answering in sanctions proceedings.

This is not a reason to avoid AI. It is a reason to document the verification workflow, maintain clear records of which outputs were AI-assisted and what review they received, and check current policy terms before expanding AI use.


The direction is clear enough without leaning on market-size precision. Vendor growth claims and legal-industry surveys both point to a category that has moved past early experimentation, especially in larger firms and corporate legal departments. The profession is already stratifying between AI-equipped practitioners and those who are not.

The leading indicator is Norm AI's AI-native law firm model: licensed attorneys supervising agents that execute portions of the substantive work. This is not a prediction that most law firms will look the same in five years. It is one plausible architecture for legal operations where tasks are repeatable, outputs are verifiable, and attorney supervision is built into the workflow.

The firms that will get this right are the ones that treat AI output as evidence to be verified, not conclusions to be filed, and that build workflows making verification fast enough that it happens every time. The sanctions cases document what happens when verification is skipped. The Harvey and CoCounsel adoption patterns document what happens when it is not.

The gap between those two outcomes is workflow design, not model capability.

For the model ownership side of that workflow decision, see Open Source AI Impact: Who Wins When Models Get Cheap.

See Deploying AI Agents to Production for the general framework, and AI Agent ROI: The Calculator and Framework That Cuts Through Vendor Math for how to evaluate the cost structure before committing to a legal AI platform.

Related: How to Build an MCP Server: A Practitioner's


Sources: AI Hallucination Cases database; Harvey AI; A&O Shearman agentic AI announcement; ABA technology survey — LawSites; AI hallucination cases — Cronkite News/ASU; Mata v. Avianca sanctions — ABA Journal; Kira contract AI — Litera; Relativity aiR; Thomson Reuters CoCounsel; IBA on privilege and AI; LexisNexis Protégé; California sanction — CalMatters; Lawyers dinged for opponents' fake citations — LawNext; AI malpractice insurance gaps — ABA Journal; DLA Piper on agentic AI risks; Enterprise AI pilot failure rate context