Table of Contents

How to Train an LLM for your Enterprise

Picture of Alex Lipinski

Alex Lipinski

Don't. Leave full LLM training for Google, OpenAI, and Anthropic. Select a model; fine-tune only if it suits you; and improve results by limiting the scope of what the model sees with RAG.

In our blog, the Document Data Crisis, we described that bad responses are not a model problem – they are a context problem. In this blog we’ll dig a bit deeper into how your content and data context is ingested into a general or foundational language model, why you wouldn’t want to train your own from scratch, and how to enhance results with Retrieval Augmented Generation (RAG).

Takeaways

Watch the Mostly Unstructured Podcast!

The issues with training an LLM or Domain Specific Language Model

There are many types of models, and within each type, several ways to train said model. We’re going to stick to what everyone thinks about when they hear AI (a foundational model), a model type that showed great promise in the business/enterprise world (DSLM), and Retrieval Augmented Generation (RAG) — which isn’t so much a model type as it is a strategy.

Foundational models

When you think of AI today, you likely envision what are called foundational language models – ChatGPT, Claude, Grok, etc. More specifically, foundational models are the engines backing up these widely popular interfaces, for example, GPT-5, Opus 4.6, Grok 4. 

Foundational models are trained on hundreds of billions to trillions of parameters and an insanely large amount of text to learn language behaviors based on probability rather than an explicit knowledge base. Give a model an input, and the model will generate an output based on the likelihood of all possible next tokens. To train a foundational model would be extremely expensive and hard to maintain. So it’s best to choose a foundational model as your base, and work to improve it’s vision on your desired enterprise data. 

Within LLMs, there is a Small Language Model (SLM), which can be as simple in design as a scaled-down version of a larger foundational model. But even an SLM’s parameters are in the billions, and changing them and their weights as business changes is incredibly tedious.

Fine-tuned models

Domain-specific language models require a greatly reduced level of training vs your own foundational model but remain an expensive and inflexible alternative. DSLMs require fine-tuning, a method of adjusting the probabilities (weights) of specific tokens so the model produces outputs that are more relevant to your industry or vertical. But there are still a few problems. 

Fine-tuning a DSLM can be an acceptable approach for businesses with little fluctuation and highly specialized industry knowledge. But for enterprises with frequently changing policies, procedures, and processes, fine-tuning is not recommended. 

What is Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) connects a generative AI model to external documents, enabling the model to retrieve relevant information from approved documents at the time of a user prompt and use that information as context when generating its response. 

Instead of relying only on model weights, RAG: 

The result is improved accuracy and relevancy of responses without the need to adjust weights, and the ability to track answers back to the source documents that fueled them. 

RAG improves answers but also exposes weaknesses in your content, moving the ownness of poor GenAI results further away from the model and back into the document data crisis.  

Your AI is only as good as the context it retrieves

Poor data quality can result from data hidden in documents that aren’t properly captured and indexed, leading to missing, duplicate, or incorrect information. That spells hallucinations, no matter how good your model is. 

A data quality problem needs to be solved upstream. To do that, it’s important to consider how RAG sources data.  

The RAG process

A RAG process doesn’t retrieve documents in their entirety to feed to a model. Instead, it relies on chunks. Chunking is a step in RAG that involves breaking down large documents that have already been captured and structuring them into smaller, manageable snippets to fit within the model’s context window (the maximum amount of information the model can process during a single interaction). 

The necessary structure for chunking is created during an Intelligent Document Processing (IDP) process, which takes unstructured data from many sources and formats and converts it into clean, labeled, structured content with preserved layout and metadata. The IDP stage is necessary for an LLM to reliably chunk and embed data into a vector database for fast, accurate retrieval by RAG. 

IDP is the data quality input control. Chunking is the handoff or transition of data. RAG surfaces it for the user. A data quality problem needs to be solved upstream. To do that, it’s important to consider how RAG sources data.  

Turn LLM training into context engineering

In their market trends, Deep Analysis states that Enterprise-grade AI governance is set to become a non-negotiable. The capabilities of RAG, while helpful, mean that teams need to be highly conscious and organized about how and what data is being converted into GenAI responses, not only to protect against bad results, but against highly sensitive data. 

Context engineering an LLM means shaping...

Foundational vs Fine-tune vs RAG

Sometimes, training an LLM is justified for very large organizations with vast resources available. But rarely. Fine-tuning is more common, but even then — tuning weights at the speed that business changes is its own kind of weight. For most enterprises, the preferred pattern will likely be selecting a foundation model and improving results for RAG with good document control upstream and governance downstream.

Keep Reading

AI agents are suffering from a lack of context

The Data Context Crisis

AI Agents become more reliable when unstructured data is properly managed from capture to formatting for AI analysis and RAG.​ IDP provides structure to unstructured data.

Read More
How to perform due diligence for intelligent document processing

Due diligence for IDP

What is due diligence for IDP and why is it important? Due diligence is the investigative process of vetting an investment or agreement to verify facts and make informed decisions. Good due diligence reduces risk and protects decision-makers from signing off on costly mistakes. With new intelligent document processing vendors

Read More
Search
Privacy Overview
KeyMark Automation Reseller and Systems Integrator Logo

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

3rd Party Cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.