How I evaluate legal RAG systems

1st May 2026

Most RAG projects skip evaluation. This is how I do it on legal builds — the three axes that matter, how to construct a test set without going crazy, and how to use LLM-as-judge without fooling yourself.

Why most RAG projects skip evaluation

The reasons are familiar to anyone who has shipped a RAG system. The demo looks plausible, the partner is impressed, the team is moving on to the next feature. Evaluation feels like the slow, unglamorous work that delays launch. There is also a quieter reason: many teams do not actually know what to measure. "Accuracy" is undefined in legal RAG until you decide which axes you care about, and most teams discover too late that their definition was the wrong one.

The cost of skipping evaluation is paid at the wrong moment. The first time a fee-earner pastes a hallucinated citation into client advice is the last time the system gets used. By then the team has shipped four months of changes — new chunking, new embeddings, new prompts — and cannot tell which change introduced the regression. Without a harness, every retrospective on a failure is guesswork.

The fix is straightforward and unfashionable. Build the eval harness first. Run it on every change. Treat it as the contract with the client about what "improvement" means.

The three things worth measuring

Three axes carry most of the signal in legal and compliance RAG. Optimise these first; everything else is secondary.

1. Citation accuracy

Did the system cite the correct source location? In legal work, the citation is the trust contract. A wrong citation is worse than no citation, because a wrong citation looks authoritative. Score this with partial credit: full credit for correct section and subsection, half credit for correct section but wrong subsection, zero for wholly wrong or missing. Where the answer should reference a schedule or cross-reference, score those independently — a system that gets the section right but misses the schedule is half-doing the job.

2. Answer faithfulness

Does the answer follow from the cited source, or has the model paraphrased its way into unsupported territory? This is the axis that catches the most insidious failure: an answer with a correct citation but a subtly wrong claim. Faithfulness is hard to score deterministically, so I use LLM-as-judge with a locked rubric. The judge sees only the cited source — not the broader corpus — which prevents it from rationalising support that the system did not actually surface.

3. Refusal correctness

When the corpus does not support a confident answer, did the system refuse? When it did support one, did the system answer? Both directions count. A system that always refuses is useless; a system that always answers is dangerous. Refusal is an engineered behaviour, not an emergent property, so it deserves an explicit axis. The simplest scoring is binary: did the refusal decision match the ground-truth label?

You could measure other things — latency, cost per query, corpus coverage — but in regulated work these three are the axes that determine whether the system is fit for use. If the system performs well on these and badly on latency, you have an optimisation problem. If it performs well on latency and badly on these, you have a system that should not ship.

Building a test set without going crazy

Writing the test set is the hardest part. Not because the work is technically difficult, but because the discipline of "what does a real user actually ask?" is harder than it looks. Most engineers, left to themselves, write questions they find interesting. Real users ask questions that are messier, narrower, and shaped by the situation in front of them.

Sources I have used, in rough order of value:

HR forum threads, advice columns, and trade-press scenarios. These reflect the actual frame in which practitioners encounter the corpus. Free, plentiful, and noisier than you would like — but the noise is part of the realism.
Past matters at the firm, where available. Historical questions whose answers are already known are gold. They give you both the question and the ground truth.
Deliberate edge-case construction. Twenty to thirty questions written by the engineer to test specific corpus features — definitional ambiguity, cross-references, amendment overlays, refusal cases. These are the questions that catch architectural regressions.

A test set of fifty to two hundred questions is enough to catch every regression that matters. More questions are better but have diminishing returns. The labelling itself is valuable: writing down expected citations and key facts often surfaces ambiguities in the corpus that the team had been silently papering over.

Using LLM-as-judge responsibly

Faithfulness scoring needs an LLM judge. There is no clean deterministic way to ask "does this paraphrase preserve the meaning of the source." But LLM-as-judge has well-known failure modes — positional bias, length bias, overgenerous scoring, drift over time — and the harness has to defend against them. Three rules:

The judge sees only the cited source, not the broader corpus. This is the most important rule. If the judge can search the corpus, it will find support that the system did not surface and will give credit the system has not earned. The judge's job is to evaluate whether the answer follows from what the system actually used, not from what was theoretically available.

The rubric is explicit and frozen. Define score bands. Pick one set of words for each band and never change them between runs. The system prompt does not get tweaked between evaluations. Without this, scores drift and the harness becomes unreliable as a comparison tool.

The judge's output includes a one-sentence justification. When a score looks wrong on review, the justification points at the disagreement. This is how you catch judge-prompt bugs — and you will have judge-prompt bugs. The first time the judge gives a 0.9 to an obviously-wrong answer, the justification will tell you why.

LLM-as-judge is not free of bias. It is, in my experience, more reliable than spot-checking by a busy engineer and less reliable than a labelled human-graded gold set. For most projects it is the right point on the cost-quality curve.

What the numbers told me about my Sarawak system

I will not publish absolute scores here, because they are not the point. The harness is the point. What I can say honestly is that the eval has caught three regressions that would have shipped silently otherwise: a chunking change that broke schedule retrieval, a prompt change that increased hallucination rate by several percentage points, and a reranker tweak that degraded refusal correctness. None of those would have surfaced from spot-checking by hand. The system felt fine in casual use even with the regressions in place.

The eval also surfaced two failure modes I now document openly. The system performs less well on questions that require synthesising across multiple Parts of the Ordinance, and on questions whose practitioner answer depends on case-law interpretation that is not in the corpus. A system that is honest about its weaknesses is more useful than a system that hides them.

One result I did not expect: the metric that moved most under iteration was refusal correctness, not citation accuracy. Citation accuracy was the easier dial to turn — a structural-chunking change moved it from the high-fifties to the low-nineties in a single afternoon. Refusal correctness took weeks of iteration on the agent prompt, the relevance threshold, and the verifier rubric, with each change moving the dial only a few percentage points. Looking back, that was the right ratio. The system that always answers is the one that gets pulled from production.

Costs and trade-offs of running the harness

The harness is not free. A full run on the current test set costs roughly one US dollar in API spend (LLM-as-judge calls dominate) and takes two to three minutes wall-clock. That is cheap on a per-run basis but compounds across a month of iteration. Two practical adjustments make this manageable. First, run a fast subset on every change — ten to twenty questions stratified across the corpus — and run the full set only on changes that pass the subset. Second, cache judge calls keyed by (question, answer, source) so identical inputs do not re-bill. Both are simple to implement and saved real money during the build.

The other cost worth naming is engineer time on test-set maintenance. As the corpus changes — new amendments, new schedules — the labels on the test set drift. A question whose ground-truth citation was Cap. 76 s. 14(2) before A1754 may now correctly cite a different subsection. The harness assumes the test set is a fixed contract; in practice it is a living artifact and needs deliberate maintenance. Budget for it.

For a worked example showing this evaluation methodology applied to a real legal corpus, see the Sarawak Labour Law case study.