Skip to content
Writing
Build in public11 min read28 May 2026

We built a private document assistant for a law firm in three weeks. Here's what broke.

by Sid

Published with the client's permission; details anonymized.

Most write-ups about AI builds are advertisements wearing a case-study costume — everything went perfectly, the client was thrilled, contact us today. This isn't that. Here's a real one, including the parts that went sideways, because the parts that go sideways are exactly what you're hiring an engineer to have already encountered.

The brief

A mid-sized law firm with two decades of accumulated work — contracts, advice memos, prior matters — wanted to make that archive answerable. A lawyer should be able to ask "have we dealt with a clause like this before, and what did we advise?" and get an answer grounded in the firm's own documents, with citations.

The hard constraint: none of it could leave the firm's control. Privileged client material going to a public AI API was a non-starter, full stop. So the entire thing had to run in their own cloud tenant, in-region, with no third party in the data path. Three-week target.

What we scoped

Deliberately narrow. One workflow — retrieval-augmented question answering over the existing document set — and nothing else. No drafting, no automation, no integrations beyond reading the archive. We've learned that scope creep is what turns a three-week build into a three-month one, so we wrote down what we were not doing as carefully as what we were.

The stack: an open model served privately in their tenant, a retrieval layer over their documents, access controls tied to their existing logins, and audit logging on every query. And the deliverable that comes with every private build we do — a written data-flow document mapping exactly where each piece of data sat and how it satisfied their obligations.

What went well

The infrastructure went up cleanly. Serving an open model privately in a cloud tenant is well-trodden ground for us, and the access-control and logging work was unglamorous but routine. By the end of week one we had a private endpoint answering questions. It felt ahead of schedule.

It was not ahead of schedule.

What broke

The retrieval problem that ate two days. The first version retrieved confidently and answered wrongly — not hallucinating from nothing, but pulling the almost-right document. Ask about an indemnity clause and it would surface a different matter's indemnity clause and answer from that, with total confidence and a citation that looked legitimate. For a law firm, "confidently wrong with a real-looking citation" is worse than "I don't know." It's the failure mode that destroys trust.

The cause was that legal documents are full of near-identical language. Standard clauses repeat across matters, so naive semantic search clustered things that read alike but were legally distinct. The fix took two days of work we hadn't fully budgeted: chunking the documents along their actual structure instead of by arbitrary length, adding metadata so the system knew which matter a chunk came from, and tuning retrieval to prefer precision over recall — better to return three highly relevant passages than ten loosely related ones. We also made the system explicit about uncertainty: when nothing was a strong match, it said so rather than reaching.

The "obvious" question we got wrong. We'd assumed lawyers would ask well-formed questions. They asked the way busy people actually ask — terse, with internal shorthand and matter numbers we hadn't accounted for. Our first version handled clean questions and fumbled real ones. We caught it only because we sat with actual users in week two instead of demoing to the partner who'd signed off. Lesson re-learned: test with the people who'll use it, not the person who bought it.

What we'd do differently

  1. Budget for retrieval tuning from the start. On any document set with repetitive language — legal, medical, financial — naive retrieval will look fine in the demo and fail in the field. We now treat tuning as a planned phase, not a surprise.
  2. Sit with real users in week one, not week two. The gap between how the buyer describes the work and how the users actually do it is where projects quietly miss.
  3. Make uncertainty a feature, not a bug. The single most trust-building change was getting the system to say "I don't have a strong match for that" instead of guessing. We now build that in from day one.

The result

The firm has a private assistant, running entirely in their own environment, that lawyers actually use — and a written compliance document they can show a client or a regulator. It shipped in three weeks and a couple of days. The couple of days were the retrieval fix, and they were the difference between a tool people trust and a tool people quietly abandon.

That's the honest version. If you want this kind of build — and the honest conversation about what might break in yours — book a free AI Review. If the data-can't-leave constraint is your whole reason for being here, start with our Private & Secure AI work.

Want this kind of build — and the honest version of what might break in yours?

Read more