What We Learned Building a Citation Engine for Email AI
Building Email AI That You Can Actually Trust Most email AI demos look great for about 30 seconds. You open a thread. The tool produces a polished...
Most email AI demos look great for about 30 seconds.
You open a thread. The tool produces a polished reply. It sounds helpful, calm, competent. Then the real question hits: would you actually send this to a customer without checking it first?
Our early drafts were good enough to be dangerous. They read well. But users still had to verify every factual claim — pricing, plan limits, security details, contract terms. The exact stuff you can't afford to get wrong.
So instead of saving time, we'd built a fact-checking job.
That forced us to get honest about what the product actually was. Not an email writer. A trust system. And trust in email has almost nothing to do with whether a draft sounds human. It comes down to one question: can you defend what it says?
From that point on, we had one rule: if a draft makes a factual claim, it cites its source. Not "based on your knowledge base." The actual doc, the actual passage, the exact line you'd check before hitting send. That decision shaped almost everything else.
Generation is easy. Retrieval is hard.
Writing fluent email is a solved problem. Any solid model can handle grammar, tone, and a friendly sign-off.
The hard problem is different: given one messy inbound email, can you find the exact evidence inside a company's docs that supports a safe reply?
A customer asks whether they can use SSO on the Pro plan and what happens to their data after cancellation. The answer might be spread across a pricing page, a security FAQ, a retention policy, a contract, a help center article, and an internal note that probably should have been a real doc months ago.
The model doesn't need more eloquence. It needs better evidence.
Bad retrieval plus a fluent model is worse than no AI at all. A weak source becomes a very confident sentence. That's exactly how trust gets destroyed.
This is where the evidence chain usually breaks: retrieval.
Keyword search wasn't enough. Neither was semantic search.
Our first instinct was simple: search for the words in the email. That works when customers use the same language your docs do. They usually don't.
They ask: "Do you support SSO?" Your docs say: "SAML-based single sign-on." They ask: "What happens to our data if we churn?" Your docs say: "retention and deletion policy." Pure keyword search rewards literal overlap and misses intent.
So we added semantic retrieval. That helped — and created a new problem.
Semantic search is good at finding passages that feel related. Related is not the same as strong enough to support an answer. A passage about enterprise security can look relevant to a data retention question. A roadmap note may match a feature question even though it absolutely shouldn't be treated as current product truth.
Semantic similarity is not factual support.
So we stopped treating retrieval as a single step and built it as a pipeline: retrieve broad candidates, filter weak or risky evidence, re-rank for answerability, pass only supportable passages into the draft. That shift mattered more than any model upgrade.
The boring infrastructure work had the biggest impact.
Chunking, for example. If chunks are too small, you lose context — "available on paid plans" is useless if the feature name and caveats got split away. If chunks are too large, everything matches a little and nothing matches well.
Better results came when chunks followed meaning instead of arbitrary token counts: pricing sections stayed with pricing tables, policy language stayed with definitions and exceptions, help center steps stayed with prerequisites.
We also got stricter about what not to cite. Every company has messy knowledge — canonical docs, outdated summaries, half-baked internal speculation with an official-looking title. If the source hierarchy is sloppy, "grounded" doesn't mean much. So we bias hard toward stronger sources: canonical over notes, newer over older, direct product docs over duplicate mentions, customer-safe documentation over internal brainstorms.
Some content is useful as context. That doesn't mean it belongs in a customer-facing claim.
We got the citation UI wrong at first.
Our first version put citations at the bottom of the draft. Clean design. Bad product decision.
People don't review drafts like research papers. They scan line by line, looking for risk. If a sentence feels off, they want the source right there.
So we moved citations inline. Behavior changed immediately. When a draft says "the Pro plan includes 500 drafts per month" and the source is attached to that sentence, review gets faster — not because users love citations, but because we'd already answered the question forming in their head: where did this come from?
Trust UI has to show up at the moment doubt shows up.
Trustworthy drafts often say less.
This is probably the most important thing we got right.
When evidence is weak, the right move is usually not to write more. It's to narrow the answer — cover only the supported part, avoid invented specifics, flag uncertainty, leave a placeholder for review.
This looks less impressive in a demo. It works much better in production.
If your system can't distinguish between "I found something related" and "I found enough evidence to say this to a customer," writing quality barely matters. Restraint is a product feature.
The model is not the product. The evidence chain is.
Anyone can generate a smooth email now. The real bar is whether a draft is trustworthy enough to use in an actual business conversation — pricing, policies, security, contracts, edge cases, commitments.
Sounding right is easy. Being verifiably right is the part that matters.
That's what we built toward with Inbox SuperPilot: a retrieval pipeline that biases toward canonical sources, a strict source hierarchy, and KB-grounded drafts with inline citations inside Gmail — so you can verify every claim before you send. The evidence chain isn't a feature. It's the product.
FAQ
What is a citation engine in the context of email AI? A citation engine links each factual claim in a draft email to the specific source document it came from — a pricing page, a help center article, a contract. The goal is to make every assertion in a draft verifiable before you send it, rather than relying on the model's general knowledge.
What's the difference between semantic search and KB-grounded retrieval? Semantic search finds passages that are conceptually similar to a query. KB-grounded retrieval goes further: it filters candidates for evidential strength, re-ranks them for answerability, and only passes passages that can actually support a factual claim into the draft. Similarity is a starting point, not an endpoint.
Why does chunking strategy matter so much? Chunks that split across logical units — separating a feature name from its caveats, or a policy from its exceptions — lose the context that makes them useful. A chunk that says "available on paid plans" with no surrounding context is technically retrieved but practically useless. Meaning-based chunking keeps related information together so retrieval returns something the model can actually use.
Why do you show citations inline rather than at the bottom? People review email drafts line by line, not as a whole document. Doubt about a specific sentence surfaces at that sentence — not at the end. Inline citations answer the question ("where did this come from?") exactly when it forms, which makes review faster and builds more trust per second of attention.
What happens when there isn't enough evidence to answer a question? The right behavior is to narrow the draft — answer only the part that's supported, flag what's uncertain, and leave a placeholder for human review. A draft that says less but says it accurately is more useful than one that fills the gap with plausible-sounding invention.
Does this approach work for small teams without a formal knowledge base? Yes. A Google Drive folder with a pricing doc and a help center counts as a knowledge base. The retrieval pipeline works proportionally — more sources produce better coverage, but even a small, well-organized set of canonical docs produces meaningfully more accurate drafts than generic AI with no grounding.
Further Reading & References
From the Inbox SuperPilot Blog
- Why Generic AI Fails in Customer Support Email Workflows
- ChatGPT vs. Gemini vs. Claude for Email: Where Generic AI Falls Short
- 5 Email Mistakes AI Catches That Humans Miss
On Retrieval-Augmented Generation (RAG)
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., 2020. The foundational RAG paper. Explains why grounding generation in retrieved documents improves factual accuracy.
- RAGAS: Automated Evaluation of Retrieval Augmented Generation — Es et al., 2023. Framework for evaluating retrieval quality and answer faithfulness — the two metrics that matter most in a citation engine.
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023. Why the position of evidence in a retrieved context window affects whether models actually use it.
On Trust and Verification in AI Systems
- Factuality Challenges in the Era of Large Language Models — Augenstein et al., 2023. A thorough survey of hallucination types and mitigation strategies in LLMs.
- Constitutional AI: Harmlessness from AI Feedback — Anthropic, 2022. Relevant background on building AI systems that can identify and flag uncertain or unsupported claims.
Ready to try Inbox SuperPilot?
Get AI-powered email drafts grounded in your knowledge base. Start for free, no credit card required.
Free plan includes 50 drafts/month. No credit card required.