LLM Options for Startups: Focus on the Middle of the Gen AI Stack with Fine Tuning and RAG

LLM Options for Startups: Focus on the Middle of the Gen AI Stack with Fine Tuning and RAG


4 min read

Every company, large and small, is integrating artificial intelligence into their offering. Most of the AI advancements that we read about come from 2 ends of the AI spectrum: foundation model building (by Google, Microsoft/Open AI, Facebook, etc.) and prompt engineering. Foundation models are extremely expensive to build and operate (ChatGPT @ $700k/day according to one source), whereas prompt engineering only gets you so far beyond what the foundation model is doing.

The Middle of LLM Customization Holds the Opportunity for Startups

Foundation Model building will be done by the largest companies - mostly Big Tech, but perhaps some collaborations between Big Tech and some domain expert. Prompt engineering is unlikely to build a hard to replicate moat - so perhaps useful in getting started, or in sprinkling on a little AI on top of your startup - but not enough to provide a sustainable engine for your startup. If you are looking for a sustainable engine, look to the middle.

Fine Tuning Primer

Fine tuning is the process of retraining an existing pre-training model with a custom dataset. It results in a new model that is based on the original model (specifically, some of the weights in the new further-trained model have been adjusted).

The process of fine tuning is as much art as it is science. For a walkthrough, you can see my collab of fine tuning the "distilbert-base-uncased" with a "squad" dataset https://colab.research.google.com/drive/1U9HVmoczzVFvF-ibkNyS_y46A-oZf16V?usp=sharing. This collab was based on HuggingFace's Fine Tuning with Custom Datasets tutorial at https://huggingface.co/docs/transformers/v4.15.0/custom_datasets

One thing to keep in mind: fine tuning is less about learning new facts, and more about learning new rules (including style). So if you fine tuned a model on some internal documents, the model is likely to learn the style how your documents are written (and potentially a process or rule for parsing out answers - as shown in the HuggingFace tutorial), but fine tuning is not the right tool for the job if you want to make specific facts available to your model. Which leads us to...

Retrieval Augmented Generation Primer

Retrieval Augmented Generation (RAG) is the process of leveraging a (large) database in advance of submitting a query to an LLM so that you can gather and append relevant context from that database. The database needs to store a special kind of data, called an embedding, which is essentially a large set of numbers that represents the meaning of some text in a multi-dimensional space. By translating text to embeddings, it is possible to determine the similarity between two sequences of text (e.g., using cosine similarity - check out a really simple low dimensionality video for cosine similarity if you are not familiar).

So what is the basic process for RAG? In short:

  • Gather a whole bunch of content, encode it as vectors and store it in a vector database

  • For each query, encode the query as a vector and then find the existing embeddings that are most similar.

  • Append the content of associated with the embedding to the prompt, perhaps with some prompt engineering that explains that this is additional context for the LLM to consider.

If you are looking for a Vector Database, there are a number of options including Pinecone (the one I first heard about), Atlas from Mongo DB, or Postgres SQL with pgvector on Vercel. Because I am building on Vercel, I have started with the pgvector option and I will let you know how it works out. Note that you may get throttled if you deploy the Vercel demo project as is - and that their demo is currently offline because they are being throttled. Let my know if you want a workaround (basically just increase your delay prior to sending queries to OpenAI - it is currently firing just about with every character typed).


Leveraging a combination of fine tuning (to get the model to understand the style and rules of a particular domain) and RAG (to get the model to understand specific facts) is a powerful approach that startups can use to adapt LLMs to specific use cases. If you are working on such an approach, I would love to learn more about what you are doing. Feel free to reach out to me on LinkedIn https://www.linkedin.com/in/mattdyor/.