Poor Retrieval Quality is killing Enterprise AI Projects in Pharma and Finance (but fine-tuned embedding models can help)

Enterprises across industries are building Retrieval-Augmented Generation (RAG) systems to help employees query their internal knowledge bases and get precise, context-aware answers. But a critical bottleneck continues to derail many of these projects — poor retrieval quality.

When a RAG system retrieves irrelevant or incomplete context, the large language model (LLM) generates inaccurate or vague responses. The result: frustration, misinformation, and failed adoption.

At SmartBots, our intern, Sanjana Srikanth, worked on a research study on fine-tuning embedding models for domain-specific RAG systems. It offers a powerful solution — one that can salvage several real-world enterprise AI initiatives that would otherwise falter because of weak retrieval foundations.

Why Fine-Tuned Embeddings Matter

Every RAG system relies on embeddings — mathematical representations of text that help the model identify relevant documents. Off-the-shelf embeddings, like OpenAI’s text-embedding-ada-002 or Microsoft’s mpnet-base, are trained on broad, generic data. They understand everyday language, but not the dense, domain-specific vocabulary found in clinical reports or financial disclosures.

Fine-tuning embeddings on an enterprise’s own corpus — say, its internal research papers, compliance manuals, or risk reports — dramatically improves retrieval accuracy.

In our study, fine-tuning improved context recall from 0.45 to 0.57 and faithfulness from 0.63 to 0.75 — enough to shift a RAG system from “confusing” to “commercially reliable.”

Let’s look at how this could transform outcomes in two high-stakes industries: pharma and finance.

Pharma: Fixing the Clinical Document Chaos

The Problem:

A major pharmaceutical company deploys a knowledge assistant to help R&D teams retrieve insights from thousands of clinical trial PDFs. Scientists are expected to ask questions like, “Which compounds showed improved efficacy in patients with Stage 3 melanoma?”

But the system retrieves irrelevant sections — sometimes referencing unrelated compounds or pre-clinical studies. Why? The embedding model doesn’t “understand” medical structure. It treats “melanoma” and “carcinoma” as roughly similar. Internal adoption stalls because users stop trusting the results.

The Solution:

By applying domain-specific fine-tuning (as described in the research), the company could train embeddings on its exact trial corpus — aligning the model’s language understanding with how their scientists actually write and query data.

Using Multiple Negatives Ranking Loss (MNRL) training, the model could learn to distinguish between closely related conditions or compounds.
With data cleaning, charts and tables embedded in PDFs would be parsed as text for retrieval.
Fine-tuned embeddings would ensure that a query about melanoma efficacy retrieves only relevant trial arms and patient cohorts.

The Result:

Instead of generic, unreliable responses, the RAG assistant surfaces precise study findings — confidence intervals, patient response rates, or compound identifiers — within seconds. Clinical reviewers could shorten literature review cycles from weeks to days, directly accelerating drug discovery.

Finance: Preventing a Risk Intelligence Failure

The Problem:

A regional bank builds an internal “Risk Intelligence Bot” to summarize exposure across regulatory filings, vendor contracts, and credit reports. Early demos impress leadership, but the rollout fails because analysts find the answers shallow and inconsistent: queries like “What are our top counterparties with exposure to LIBOR risk?” produce partial or irrelevant summaries.

The culprit? The embedding model cannot distinguish between general financial jargon and the bank’s proprietary risk terminology. For example, “exposure,” “liability,” and “counterparty risk” are treated interchangeably. The model often pulls data from the wrong filing sections, leading to inaccurate risk summaries.

The Solution:

Using full fine-tuning on the bank’s own corpus of risk and compliance documents could solve this.

Embeddings would learn from real exposure tables, regulatory language, and internal credit risk memos.
A fine-tuned RAG pipeline would retrieve relevant chunks from hundreds of filings — not just based on keyword proximity but semantic precision.
Analysts could query nuanced questions (e.g., “Which counterparties have exposure over $5M with LIBOR-linked derivatives?”) and receive grounded, data-backed responses.

The Result:

Audit-ready, explainable retrieval — where every answer is supported by precise document context. This would transform the bot from a demo tool into a decision-support system, potentially saving millions in compliance rework and missed risk signals.

The Broader Lesson for Enterprises

The research offers three critical takeaways for CXOs planning or troubleshooting enterprise AI deployments:

Fix retrieval before scaling generation.
Even the best LLM can’t compensate for weak retrieval. Start by fine-tuning your embeddings on your domain corpus.
Full fine-tuning beats shortcuts.
Parameter-efficient methods like PEFT improve speed but lag significantly behind full fine-tuning in recall and precision.
Treat fine-tuning as continuous maintenance.
As new documents enter the enterprise knowledge base, periodic re-finetuning keeps retrieval quality high without retraining the entire model.

Closing Thoughts

In both pharma and finance, RAG projects often fail not because the technology is flawed — but because context is misunderstood.

Fine-tuned embeddings change that. They make AI retrieval systems context-aware, compliant, and truly enterprise-ready.

For enterprises building next-generation AI assistants, this is the foundation that ensures accuracy, trust, and long-term ROI. And at SmartBots, we can help organizations fine-tune their AI foundations — embedding intelligence that makes every answer count.