azamzam.ai

Chunking Malaysian statutes for retrieval: what worked and what didn't

15th April 2026

The single largest jump in retrieval accuracy on the Sarawak Labour Law system came from re-chunking. The embedding model never changed. Here is what I tried first, why it failed, what I changed, and what I would still do differently.

Why chunking matters more than embedding choice

There is a strong industry instinct, when retrieval quality is poor, to reach for a better embedding model. It is the visible, branded part of the stack. The vendor publishes benchmarks, the framework makes the swap easy, and the change is satisfying because it feels like progress. Most of the time it is not progress. The retrieval was not failing because the embedding was weak; it was failing because the answer-bearing text never made it into the candidate set.

You can prove this to yourself with one diagnostic. For every failure case in your eval set, look at the top-k retrieved chunks and ask: was the right answer somewhere in those chunks? If the answer is no, the embedding model is not the problem. The chunking is, the metadata is, or the corpus's structural representation is. No embedding upgrade will save you, because the model can only rerank what retrieval surfaces. The problem is upstream.

Once you internalise that, chunking stops looking like a preprocessing detail and starts looking like the highest-leverage decision in the build.

The structure of Malaysian statutes

A Malaysian statute is a structured document. At the top level it is divided into Parts. Each Part contains numbered Sections. Each Section may contain numbered Subsections, which may themselves contain paragraphs labelled (a), (b), (c) and so on. Schedules at the back of the statute carry their own numbering and apply, often, only to specific subject matter or employer classes.

That hierarchy is not decorative. It is how lawyers read the document, how citations are constructed, and how amendments are surgically applied. When the Sarawak Labour Ordinance is amended by Act A1754, the amendment is keyed to specific subsections of Cap. 76 — not to chunks of arbitrary text. A chunker that ignores the hierarchy is a chunker that ignores the actual semantics of the corpus.

There is one further wrinkle. The "current" form of a Malaysian statute is rarely a single document. It is the original statute plus a stack of amendment Acts, each of which inserts, deletes, or replaces specific subsections. The print version of Cap. 76 in common circulation does not always render the A1754 overlay clearly. Practitioners are expected to know which sections are silently overridden. Any retrieval pipeline that is not aware of amendment status will happily surface the wrong-but-printed text.

Naive chunking and why it fails

The first chunking attempt used the framework default: a 512-token window with 64-token overlap, walking the document linearly. On a smoke set of fifteen handwritten questions, recall on the answer-bearing chunks landed somewhere around 60%. The failures were not random. They had a clear shape.

The most damaging failure mode was mid-section splits. Section 14 of Cap. 76 covers termination of contract. Subsection (1) describes the right to give notice; subsection (2) sets the notice length and points at the First Schedule; later subsections deal with summary dismissal. A 512-token window, depending on where it lands, can split this section in half. When a user asks "what is the minimum notice period for termination?", the retrieved chunk contains the words "shall not be less than the period specified in the First Schedule" with no awareness that the First Schedule needs to be in the candidate set. The agent answers with a confident-sounding generalisation, and the user gets a wrong answer.

The second failure mode was definitional starvation. Section 2 carries the definitions for the entire Ordinance. A naive chunker will chunk Section 2 like any other section, and the definitions will compete with hundreds of other chunks for retrieval. When a user asks "what counts as a workman in Sarawak?", the retriever may surface a section that uses the word "workman" rather than the section that defines it. Lawyers reading the answer will spot the gap immediately.

The third failure mode was schedule blindness. Several schedules apply only to specific employer classes — estate workers, manufacturing, domestic helpers. The schedules are short, structurally distinct, and full of jurisdictional carve-outs. A token-window chunker treats them like any other paragraph and they retrieve poorly on questions that should explicitly route to them.

The chunking strategy that worked

The strategy that worked is rule-based and structural. Each chunk corresponds to a single subsection of Cap. 76 (after the A1754 overlay has been applied), with the section header and any defined terms used in the subsection prepended as context. Where a subsection is short, sibling subsections are concatenated up to a soft cap of around 800 tokens. Where a subsection is long, it is left intact as a single chunk. Schedules are chunked at the schedule-item boundary.

# pseudocode for the chunker

for section in parse_sections(amended_corpus):
    for subsection in section.subsections:
        chunk = render_chunk(
            section_header=section.header,
            text=subsection.text,
            metadata={
                "section": section.number,
                "subsection": subsection.number,
                "part": section.part,
                "amended_by": subsection.amendment_ref,
                "cross_refs": extract_refs(subsection.text),
                "defines": extract_defined_terms(subsection.text),
            },
        )
        store(chunk)

The win came from three things, in roughly this order of impact. First, chunks now respect the document's natural structure, which means a question that should retrieve Section 14 retrieves all of Section 14 as a single coherent chunk. Second, the metadata schema gives the retriever something to filter and re-query against — cross-references and defined terms become first-class hints, not buried strings. Third, prepending the section header ensures every chunk carries enough context to be retrievable on its own; an embedding model cannot read the surrounding section just because it is "nearby" in the corpus.

On the same fifteen-question smoke set, recall went from roughly 60% to over 90%. The embedding model never changed. That is the reason chunking is a higher-leverage decision than embedding choice on most legal corpora.

Metadata schema decisions

The metadata schema settled at six fields. They are not all equally important. section and subsection drive citation rendering; part rarely affects retrieval but improves human-readable output; amended_by is critical because it lets the retriever de-prioritise stale subsections; cross_refs is what enables the agent to fetch a missing schedule when a section points at it; defines is the field that fixes definitional starvation.

I had a longer schema at first — effective dates, gazette references, internal commentary tags — and almost all of it turned out to be either unused or harmful. Unused metadata is fine. Harmful metadata is the kind that introduces noise into hybrid retrieval (BM25 starts matching against tag strings, ranking changes for the wrong reasons). The discipline is to add metadata fields only when an eval-driven decision needs them.

One field I am still unsure about: a structured representation of cross-references rather than a flat array. The flat array works for single-hop cross-references; for chained ones (Section 14 points to the First Schedule, which points to Section 60), a graph representation might pay for itself. I have not measured this yet.

Testing chunking decisions against the eval harness

Every chunking change ran through the eval harness before it was kept. The discipline was straightforward: cut a branch, change one thing, run the harness, look at the deltas on citation accuracy and faithfulness, decide. The eval set was deliberately small at first — about thirty questions — because the goal was fast iteration rather than statistical significance. Once the chunking strategy stabilised, the test set grew to its current size.

One pattern emerged that I had not expected. Chunking changes that improved citation accuracy sometimes degraded faithfulness, because the larger, structurally-aware chunks gave the LLM more room to paraphrase its way into unsupported territory. The fix was a tighter generation prompt that required the model to quote-and-cite rather than summarise-and-cite. That change would not have been visible without the harness; the casual eye sees a system that "found the right section" and assumes correctness has improved.

A second pattern: chunking changes that helped on the easy questions sometimes hurt on the refusal cases. Bigger, richer chunks meant the system was more confident on borderline questions where it should have refused. The harness caught this on the refusal-correct rate — a metric that has no equivalent in spot-check testing. This is the value the harness earns: invisible regressions become visible deltas.

What I'd do differently next time

Two things. First, I would invest earlier in the amendment overlay pipeline. I started with the print Cap. 76 and added A1754 awareness later, which meant the first three weeks of evaluation were on a corpus that had subtly wrong text in places. A clean amended corpus from the start would have saved real time, and it would have made the early eval results more trustworthy as a baseline. Second, I would write the chunker in a way that lets multiple chunking strategies coexist behind a feature flag, with the eval harness running both and reporting the delta. I ended up doing that anyway, just later than I should have.

A third thing, smaller but worth naming: I underestimated how much of the chunking work is parsing, not chunking. The first version of the parser handled about 80% of the structural cases correctly. The remaining 20% — nested clauses, unconventional schedule formats, definitional sub-paragraphs — took roughly as long to handle correctly as the first 80% did. This is a familiar tax on legal-text engineering and worth budgeting for explicitly the next time.

For the broader picture — how this chunking strategy fits into the rest of the system — see the Sarawak Labour Law case study.