azamzam.ai

Sarawak Labour Law — an agentic RAG system over Cap. 76 and Act A1754

A worked example of building a citation-grounded legal AI assistant for a complex jurisdictional domain. The chunking, the embedding choice, the agent loop, the evaluation — with the trade-offs and the failures included.

Reading time: about 18 minutes · Open the live demo →

Why this domain matters

Sarawak's labour-law regime is unusual in Malaysia. Where most of the country sits under the federal Employment Act 1955 (Akta 265, since restructured under Akta 732), Sarawak has its own statute — the Sarawak Labour Ordinance, Cap. 76 — with its own definitions, schedules, and enforcement framework. In 2022 the Ordinance was substantially amended by Act A1754, bringing a number of federal alignments into Sarawak law while preserving several jurisdictional differences.

The result is a corpus that lawyers and HR practitioners have to read as an overlay rather than as a single document. Reading Cap. 76 alone gives you the wrong answer. Reading A1754 alone gives you a fragment. Reading them together correctly — section by section, with cross-references to the federal regime where relevant — is the actual professional task.

The consequences of getting it wrong are not academic. An employer who applies the federal maximum-hours rule to a Sarawak-employed workman has just created a basis for a labour-court complaint. An HR officer who quotes the wrong notice-of-termination provision has just exposed the company to a wrongful-dismissal claim. The reason this domain is interesting from a RAG-engineering point of view is the same reason it is risky from a legal point of view: the answers are not on the surface, and a system that pretends they are will produce confidently wrong outputs that look indistinguishable from competent legal analysis.

This is the kind of corpus where a generic legal chatbot built on a public-internet crawl will perform somewhere between badly and dangerously. It is also the kind of corpus where a purpose-built, citation-grounded RAG system can demonstrably help — if the engineering is honest about the structure of the underlying material.

Domain discovery and corpus analysis

The build began with the source materials, not with the model. I read Cap. 76 end to end in its current published form, then read A1754 end to end against the original. The amendment Act is an overlay: it inserts, replaces, and deletes specific subsections of Cap. 76, but the print version of Cap. 76 in common circulation does not always render the overlay clearly. Several sections in the unamended print are silently overridden by A1754, and the practitioner is expected to know.

That observation, more than any architectural decision, defined the chunking strategy. Any chunker that treated Cap. 76 as a static document would happily retrieve sections that were no longer the current law. The retrieval layer needed to know about amendment status as a first-class metadata field, and the corpus needed to be rebuilt as an amended view of Cap. 76 with explicit pointers from each affected subsection back to the A1754 provision that changed it.

Beyond the amendment overlay, the corpus had a number of structural features that drive retrieval quality:

  • Definitions in Section 2. Most legal questions hinge on whether the person in question is a "workman" under the Ordinance. Section 2 carries the definitional weight for the entire statute, and any retrieval over a question involving a worker classification needs Section 2 in the candidate set even when the user has not explicitly asked about definitions.
  • Cross-references between sections. Section 104 on hours of work cross-references Section 60 on overtime; Section 14 on termination cross-references the schedule on minimum notice periods; the schedules carry their own jurisdictional carve-outs. A chunk that contains "see Section 60" without knowing which Section 60 it is referring to is a chunk that produces wrong answers.
  • Schedules with jurisdictional carve-outs. Several schedules apply only to specific employer classes (e.g. estates, manufacturing). A retrieval over "what is the maximum hours per week" that does not surface the relevant schedule will give a federal-aligned answer when a schedule-specific answer is correct.
  • The federal interplay. A growing number of practitioner questions are framed as "does the federal Akta 732 rule apply in Sarawak, or is there a different Sarawak rule?" For those questions, the corpus has to include both, and the agent has to handle the comparison explicitly.

Mapping these structural features took roughly four working days. It was the highest-leverage time spent on the project. Every architectural decision downstream — chunk boundaries, metadata schema, embedding choice, agent loop — was shaped by what came out of the corpus map.

Chunking and metadata strategy

The chunking strategy is rule-based, not semantic. Each chunk corresponds to a single subsection of Cap. 76 (after the A1754 overlay is applied), with the section header and any defined terms used in the subsection prepended as context. Where a subsection is short, sibling subsections are concatenated up to a soft cap of around 800 tokens; where a subsection is long, it is left as a single chunk. Schedules are chunked at the schedule-item boundary, not by token windows.

This was not the first chunking strategy I tried. The first attempt used a 512-token window with 64-token overlap — the framework default. On a smoke set of fifteen handwritten questions, that approach achieved roughly 60% recall on the answer-bearing chunks. After re-chunking on subsection boundaries, the same smoke set climbed to over 90% on the same retrieval pipeline. The embedding model never changed. The win came entirely from giving the retriever chunks that respected the document's natural structure.

One worked example shows why. Consider Section 14 on termination of contract:

Section 14. Termination of contract of service.
(1) Either party to a contract of service may at any time give to the
    other party notice of his intention to terminate such contract.
(2) The length of such notice shall be the same for both employer and
    employee and shall be determined by any provision made for such
    notice in the terms of the contract of service or, in the absence
    of such provision in writing, shall not be less than the period
    specified in the First Schedule.
(3) [as amended by Act A1754] ...

A naive 512-token chunker will happily split this section in half mid-subsection. The retrieval layer will then return a chunk that says "shall not be less than the period specified in the First Schedule" with no awareness that the First Schedule needs to be in the candidate set. The agent will paper over the gap with a confident-sounding generalisation, and the user will get a wrong answer.

The rule-based chunker, by contrast, keeps Section 14 intact, prepends the section header, and tags the chunk with a metadata field cross_refs: ["Schedule 1"]. When the retriever sees that field, it pulls the relevant schedule item into the candidate set as well. The agent answers with both pieces grounded.

The metadata schema settled at six fields:

  • section — the section number, e.g. "14".
  • subsection — the subsection identifier, e.g. "(2)".
  • part — the Part of the Ordinance the section sits in.
  • amended_by — the A1754 provision that changed this subsection, or null.
  • cross_refs — an array of cross-referenced sections, schedules, or definitions extracted at chunking time.
  • defines — an array of terms whose definitions are present in the chunk (used to ensure Section 2 chunks score higher on definitional queries).
The metadata schema does more retrieval work than the embedding model. Get the metadata right and a mid-tier embedding model will outperform a frontier model fed sloppy chunks.

Embedding model selection

The embedding model used in production is Voyage AI's voyage-law-2. The choice was informed rather than benchmarked: the corpus is jurisdictionally-loaded legal text, and the model is one of the small number of embeddings explicitly trained on legal content. Before settling on it, I evaluated three candidates against a thirty-question held-out set written from realistic HR scenarios:

  • voyage-law-2 — legal-domain Voyage model.
  • text-embedding-3-large — OpenAI's general-purpose strong baseline.
  • text-embedding-3-small — smaller OpenAI baseline, included to check whether the domain model was earning its cost.

The general-purpose models retrieved adjacent-but-wrong sections more often on jurisdictionally-loaded queries. A question like "what counts as a workman in Sarawak" would, on the general models, retrieve federal-Act definitions or labour-law commentary alongside the Cap. 76 definition. The legal-domain model was visibly better at preferring the in-corpus definition. That difference, even on a small held-out set, was enough to make the choice.

This is an honest-but-limited claim. I did not run a formal benchmark on a published legal-IR dataset. The held-out set was thirty questions, written by me, scored by a rubric I designed. The decision is reproducible by anyone who wants to repeat the experiment, but it is not authoritative in the way a peer-reviewed benchmark would be.

The trade-off worth naming: voyage-law-2 ties the system to a paid embedding API, with the latency and dependency cost that implies. For a client whose compliance posture forbids sending content to third-party embedding endpoints, the right choice is a self-hosted general embedding plus more aggressive chunking and reranking work. The methodology does not change; the models do.

Agentic retrieval architecture

The retrieval architecture is a small agent loop sitting on top of a hybrid-search pipeline. The loop is small on purpose. Most user questions are served in a single pass. The agent only does additional work where the question topology demands it — multi-hop questions, cross-reference resolution, and refusal cases.

Architecture diagram: user query enters the agent controller, which calls the retriever (Qdrant) and the LLM (Anthropic Claude); the citation verifier checks the answer is grounded before it returns a response.
High-level architecture. The agent controller orchestrates retrieval, generation, and verification.

The flow:

  1. Query rewrite. The user's question is rewritten into one or more retrieval queries that surface the structural features the corpus is keyed on. A question like "can my boss in Kuching fire me without notice?" is rewritten into queries that target Cap. 76 termination provisions, notice-of-termination schedules, and definitional questions about whether the user is a workman within the Ordinance.
  2. Hybrid retrieval. Each rewritten query runs through both a dense retriever (Qdrant, voyage-law-2 embeddings) and a sparse retriever (BM25 over the same chunks). The candidate sets are merged and reranked.
  3. Agent decision. The agent inspects the merged candidates and the rewritten queries. It decides one of three things: (a) generate an answer with the candidates as context; (b) re-query, because a cross-referenced section is missing from the candidate set and the metadata says it should be there; (c) refuse, because no candidate is a credible match for the question.
  4. Generation. If (a) or after a successful (b), Anthropic Claude generates the answer. The system prompt requires the answer to cite specific sections and forbids answers that cannot be tied to a retrieved chunk.
  5. Citation verification. Before the answer is returned to the user, a verifier confirms that each cited source actually contains the asserted fact. Where the verifier finds an unsupported claim, the answer is regenerated with stricter grounding instructions, or, if regeneration fails twice, the system refuses.
Retrieval flow: dense and sparse retrieval converge into a reranker; the agent decides whether to re-query before generation; the verifier grounds the final answer.
Multi-hop retrieval. The dashed loop is taken only when retrieval is judged insufficient.

Refusal behaviour is treated as a feature, not a fallback. The agent has a refusal rubric: refuse when (i) no candidate scores above the relevance threshold, (ii) the question is outside Sarawak labour law, or (iii) the candidates are mutually contradictory in a way the agent cannot resolve. Refusals carry a short explanation and, where possible, a pointer to where the user might look. The refusal rate on the eval set is in the high single digits and is, deliberately, non-zero.

Refusal is engineered, not emergent. A system that always answers is a system that lies sometimes. A system that refuses cleanly is a system that practitioners can trust.

Citation and grounding

Every answer the system produces ends with a citation block. Each citation gives the section number, the subsection, the part of the Ordinance, and, where applicable, the A1754 provision that amended the cited subsection. Where the answer draws on a schedule, the schedule item is named explicitly. Where the answer touches the federal-Sarawak interplay, the corresponding federal section is named alongside.

The verifier, mentioned above, is the part of the system that turns this from a presentation choice into a correctness check. The verifier runs as a separate LLM call with a locked rubric that asks, for each cited claim in the answer, whether the cited source contains the claim. The verifier has access only to the cited chunks — not to the broader corpus — so it cannot rationalise its way to a positive answer. Where a claim fails verification, the system regenerates; where regeneration fails twice, the system refuses cleanly.

"No answer" is handled honestly. The user is told, in plain language, that the corpus does not support a confident answer, and is given a one-line reason: the question is outside the corpus's coverage, or the candidates conflict, or the available material is too sparse. The system never produces a "best-effort guess" presented as an answer. That single design choice is the largest single source of practitioner trust.

Evaluation methodology

The evaluation harness is the part of the system I am most confident about and the part I would be most embarrassed for a client to see omitted. The harness runs against a curated test set of realistic Sarawak labour-law questions, each labelled with expected citations and key facts. It scores three axes:

  • Citation accuracy. Did the system cite the correct section and subsection? Partial credit is given for correct section but wrong subsection, and for correct subsection but missing cross-reference.
  • Answer faithfulness. Does the generated answer follow from the cited source? Scored by an LLM-as-judge call with a rubric that explicitly tests for unsupported claims, hallucinated case law, and overgeneralisation.
  • Refusal correctness. Was a refusal called for, and if so, did the system refuse? Was an answer called for, and if so, did the system answer? Both directions are scored, because over-refusal is as bad as over-answering.

The test set was constructed from realistic HR scenarios — the kind of question an HR officer at a Sarawak-based company actually asks. Sources included publicly-available HR forum threads, employment-law handbooks, and a small number of scenarios I wrote by hand to deliberately test the corpus's edge cases (definitional ambiguity, A1754 overlays, federal-Sarawak interplay, schedule-specific carve-outs).

The harness runs in CI on every meaningful change. Each run produces a per-question CSV and a summary table; both are committed alongside the change so the delta is reviewable. The harness has caught three regressions that would have shipped silently otherwise: a chunking change that broke schedule retrieval, a prompt change that increased hallucination rate by several percentage points, and a reranker tweak that degraded refusal correctness.

Honest results: the system performs well on the questions it was designed for — clear, single-section questions about hours, leave, termination, and definitions. It performs less well on questions that require synthesising across multiple Parts of the Ordinance, or where the practitioner answer depends on case-law interpretation that is not in the corpus. Those failure modes are documented in the eval-harness README and on the live demo page.

What this approach generalises to

The Sarawak Labour Law system is a worked example of a methodology, not a one-off. The same architecture — structural chunking, metadata-driven retrieval, agentic re-query loop, citation verification, an evaluation harness that ships in week one — applies directly to:

  • Industry sustainability standards. Certification frameworks are structured, citation-rich documents with the same kind of schedule overlays and audit-evidence requirements that legal text carries. The retrieval architecture transfers without modification.
  • Stock exchange listing requirements. Exchange rulebooks are large, amended, cross-referenced documents whose practitioner use case — "what does the exchange require for X situation" — is a near-perfect fit for this architecture.
  • Financial regulator guidelines. Compliance officers in regulated financial institutions deal with the same kind of overlay-and-cross-reference structure across regulator circulars and policy documents.
  • Internal corporate policy bases. Employee handbooks, procurement policies, and large engineering runbooks all share the structural-document character that this architecture is built for.

What changes in each domain is the corpus shape and the metadata schema. What does not change is the discipline: read the corpus, model its structure, build the harness first, add agentic complexity only where the eval shows the simpler approach failing.

Try the demo, or get in touch

The system is live at the demo page. A few example prompts are listed there.

If your firm or team is weighing a similar build — legal, sustainability, exchange compliance, financial regulation, or any structured-document domain — I would like to hear about it. Get in touch →