azamzam.ai

How I build agentic RAG systems

Eight steps, in order. Each step has a defined input, a defined output, and a checkpoint with the client. The methodology is repeatable because the failure modes are repeatable: the same things go wrong in the same places, and a disciplined sequence catches them before they become expensive. I do not skip steps to save time. Skipping is what makes RAG projects miss their accuracy targets six weeks in, with the team unable to explain why.

1. Domain discovery

Three to five interviews with the people who will actually use the system, plus one with the partner or executive who is funding it. The goal is not to collect a wishlist of features. It is to identify the decision points the system must support — the moments where a user is sitting in front of the corpus trying to answer a specific question for a specific purpose.

Output: a one-page scope describing the user, the decisions, the corpus, and what "good" looks like. No architecture decisions yet. If a buyer wants to skip discovery and go straight to build, that is a strong signal that the project will fail, and I say so.

In practiceFor Sarawak Labour Law, discovery surfaced that the dominant user task was not "look up a section" but "given this employment situation, which provisions apply across Cap. 76 and the federal Act?" That changed the architecture before a line of code was written.

2. Corpus analysis

I read the corpus. All of it, or a representative slice if it is very large. I map structure (parts, sections, schedules, amendments, cross-references), citation patterns (how lawyers or auditors actually refer to it), and ambiguity hotspots (where a generic chunker will produce nonsense). This is hands-on work, not delegated to a script.

Output: a corpus shape document, including the kinds of queries that will be hardest to serve. This is where most of the engineering risk gets surfaced. Most projects skip this and pay for the skip in week six.

In practiceReading Cap. 76 end-to-end revealed that several sections were silently overridden by Act A1754 amendments without the print version showing the overlay. That alone defined the metadata schema.

3. Chunking and metadata strategy

Decide chunk boundaries (rule-based on legal or structural markers vs semantic), chunk size, overlap, and metadata schema. The metadata schema is often more important than the chunk boundary — section number, part, definitions present, cross-references, amendment status, jurisdiction. These fields drive filtering, re-querying, and citation rendering.

Output: a chunker, a metadata extractor, and a populated vector store. Always test retrieval against a small handwritten query set before moving on. If recall on the smoke set is poor, do not advance.

4. Embedding model selection

Choose the embedding model based on the corpus and the queries, not on benchmarks alone. For legal and compliance work, a domain-specific model usually outperforms a general embedding model on jurisdictionally-loaded queries. The choice is informed: read the model's training disclosure, test against the client's actual question style, and fall back to general models only when the domain model offers no measurable lift.

Output: a chosen embedding model with the rationale documented. The decision goes in the engineering log so the next person can audit it.

In practiceFor Sarawak Labour Law I selected Voyage's voyage-law-2 over OpenAI's text-embedding-3-large based on retrieval lift on a small held-out set of jurisdictionally-loaded queries. The general model retrieved adjacent-but-wrong sections more often.

5. Retrieval architecture

Layered retrieval: hybrid search (dense + sparse) with a reranker, optional metadata filters, and a multi-hop loop only where the question topology demands it. Tune in this order: recall first, then precision, then latency. Add complexity only when the eval shows the simpler approach failing.

Output: a retrieval pipeline with measured recall and precision against the eval set. Numbers go in the engineering log next to the architecture decision.

6. Agentic layer design

Define the decision points: when the agent re-queries, when it fetches additional context, when it calls a tool, when it refuses. Refusal behaviour is deliberately engineered, not an emergent property. The agent should answer plainly and cite cleanly when it can, and refuse cleanly with a reason when it cannot — no apologetic hedging, no plausible-sounding guesses.

Output: an agent specification and a working implementation, evaluated against a refusal-test subset of the eval harness. If refusal correctness is below the agreed threshold, the agent is not shipped.

7. Evaluation harness

A reusable test set of fifty to two hundred realistic questions, each labelled with expected citations and key facts. Three scoring axes: citation accuracy, answer faithfulness, and refusal correctness. Where LLM-as-judge is used, the rubric is explicit and the prompt is locked. The harness runs in CI on every change, and the deltas are recorded.

Output: a runnable harness, baseline metrics, and a regression-detection threshold the client agrees to before launch.

8. Deployment and handover

Deploy to infrastructure the client controls — typically a small VM or a managed Kubernetes cluster on their cloud, with the vector store and the LLM endpoint configured for their compliance posture. Set up logging, basic monitoring, and a feedback channel. Train two members of the client's team on the codebase, the harness, and the runbook.

Output: a production system, documentation, and an internal team that can keep it running without me. That is the actual deliverable. A system the client cannot maintain is a worse outcome than no system at all.


See this methodology applied end-to-end: the Sarawak Labour Law case study.