How I think about agentic RAG

Six principles. They are not theoretical positions. They are the rules I run my own builds against, and they came from getting things wrong on real corpora before getting them right. If a principle fails the eval, it gets revised; the ones below are the ones that have held up.

1. Retrieval is a search problem before it is an ML problem.

Most RAG failures trace back to bad chunking and missing metadata, not to the embedding model or the LLM. There is a strong industry instinct to reach for a bigger model when answers go wrong, but the failure mode is usually upstream: the right chunk was never in the candidate set. No model can reason its way out of an empty retrieval window.

The first thing I do on every build is look at the failures and ask whether the answer-bearing text was in the top-k retrieved chunks. If it wasn't, the embedding model is not the problem. The chunking is, or the metadata is, or the structural representation of the corpus is. Fix those and the rest of the system gets noticeably easier.

From the Sarawak build: the largest single jump in retrieval accuracy came from re-chunking on section boundaries instead of fixed token windows. The embedding model never changed.

2. Every answer must cite; every citation must verify.

In regulated domains, an unverifiable claim is worse than no claim at all. A wrong answer caught early is recoverable; a confidently-wrong answer with no traceability destroys trust permanently. The first time a partner pastes a hallucinated section number into client advice is the last time the system gets used.

Two things follow from this. First, every answer the system produces must include the specific source location it draws from — section, subsection, schedule, page, paragraph. Second, the citations are not styling. They are checked: a verifier confirms that the cited source actually contains the asserted facts before the answer reaches the user. If it doesn't, the system either retries or refuses. There is no path where unverified text reaches a fee-earner.

3. Evaluation comes before optimisation.

You cannot improve what you cannot measure. The first deliverable in any engagement I take is the eval harness, not the chatbot. Once a harness exists, every later decision — chunk size, embedding model, reranker, prompt, agent loop — is a measured delta. Without one, "improvement" is a story people tell themselves on the way to launch.

The harness does not need to be elaborate. Fifty to two hundred carefully written questions, labelled with expected citations and key facts, scored on three axes: citation accuracy, answer faithfulness, and refusal correctness. That is enough to catch every regression that matters and to make every architectural decision defensible.

The Sarawak harness has caught three regressions that would have shipped silently otherwise. None of them showed up in spot-check testing.

4. Agentic reasoning is for ambiguity, not for show.

Agent loops have become a default for anyone using a modern framework, and most of the time they are decoration. Each tool call costs latency, tokens, and failure surface area. The right test for adding an agent loop is empirical: does it measurably improve the eval score on the questions where one-shot retrieval is failing? If the answer is no, single-shot is the better architecture.

Where agent loops earn their place is on questions that genuinely require multi-hop reasoning — "what does the law say about X, given the amendment in Y" — or on ambiguous queries where the agent should ask one clarifying question rather than guess. Those are the cases I instrument and measure. Everything else stays single-shot.

5. Domain structure beats model size.

A well-modelled corpus served by a 7B-class model usually outperforms a poorly-modelled corpus served by a frontier model. Modelling the corpus means encoding what a human expert knows about how it is organised: which sections amend which, which definitions apply where, which cross-references must be followed for a complete answer.

This is the work that cannot be outsourced to the LLM. The model does not know that a section in Cap. 76 has been silently overridden by Act A1754; it does not know that the definition of "workman" is jurisdictionally narrower than the federal definition; it does not know that a schedule applies only to certain employer classes. That knowledge has to be in the retrieval layer or the answers will be plausible and wrong.

6. The corpus belongs to the client.

I build on the client's own licensed and proprietary content. That sidesteps copyright exposure, builds the kind of trust that lets a firm put the system in front of fee-earners, and creates defensible long-term value. A system trained on a firm's own decade of contracts, memoranda, and case files is a different — and stronger — product than anything that scrapes the public internet.

This is also a positioning choice. Public-internet RAG is a commodity; private-corpus RAG is a craft. The latter is what I sell because the latter is where the actual value lives for the kind of buyer who hires me.

These principles inform every system I build. Next: the eight-step methodology that puts them into practice.