AI & ENGINEERING · 12 min read · May 8, 2026

We replaced a $40k/yr support team with a RAG bot. Here's exactly how.

A 12-week build that started as an experiment and ended up handling 71% of tickets — with the eval framework, prompt versions, and three mistakes we wish we'd avoided.

The brief

Our client — a B2B SaaS in the legal-tech space — was paying ~$40,000 a year for a small offshore support team to answer the same 60 questions over and over. Their docs were excellent but never read. Their power users wanted faster answers; their new users wanted a smarter onboarding.

The pitch from us was simple: "What if 70% of these tickets never needed a human?" Twelve weeks later, that's exactly what we shipped.

Architecture, in plain English

We built a retrieval-augmented generation (RAG) system on top of Claude Sonnet 4.6, with the company's docs, ticket history, and product knowledge base as the corpus. The full pipeline:

Ingest: nightly crawl of Notion docs, Zendesk tickets, and product release notes.
Chunk & embed: semantic chunks with overlap, embedded via text-embedding-3-large, stored in pgvector.
Retrieve: hybrid search — BM25 lexical match + dense semantic — re-ranked with Cohere's reranker.
Generate: answer composed with strict instructions to cite sources and refuse out-of-corpus questions.
Eval: a golden test set of 240 historical tickets gates every deploy.

"Half the work was the model. The other half was the eval suite. Without the latter you're flying blind on every prompt change."

The three things we got wrong

1. We trusted vibes over the eval suite (briefly)

Around week 4 we shipped a prompt tweak that "felt obviously better" — clearer instructions, more polite refusals. The eval suite quietly dropped from 84% to 71% accuracy. We caught it in CI because we had the gate, but for a panicked 20 minutes we thought we had a regression in retrieval, not the prompt. Lesson: every prompt PR runs the full eval, no exceptions.

2. We under-invested in refusals

The first version was eager to please — it would happily answer questions about the law itself, when the corpus only covered the product. Users loved it for a week, then started filing tickets when answers were wrong. We rebuilt the system prompt with explicit refusal rules and a quick classifier in front to catch out-of-scope questions early.

3. We forgot about the long tail of source freshness

The product team shipped a major UI change in week 9. The bot kept citing the old docs for three days because our re-embed cadence was daily, not on-publish. We added a Notion webhook to invalidate & re-embed within minutes.

What it cost, and what it saved

Build cost: ~$58k (12 weeks, two senior engineers + one ML eng + one PM).
Run cost: ~$340/mo (LLM tokens, embeddings, hosting). Negligible.
Replaced: ~$40k/yr in support headcount, plus the head of support reclaimed her time for higher-value work.
CSAT: first-response time dropped from 4 hours to 8 seconds. CSAT held steady.

Would we recommend it?

Yes — but only if you have well-maintained docs. RAG is a multiplier on documentation quality. If your knowledge base is a mess, the bot will confidently parrot the mess. Fix the docs first; the bot pays for itself second.

If you're thinking about something similar, book a call — we'll show you the eval framework and the prompt repository, and you can decide whether to build it yourselves or whether it's worth bringing us in.

Rajesh K.

Founder & CEO · Skill Horizon Technologies