We’ve been working with LLMs in production for over a year. Here’s my learnings + what I think is an open problem to be solved!
Some context: Clueso (YC W23) is used by over 160 companies in making content around SaaS products. One of our hero features, called AI Rewrite, utilises LLMs to transform a rough video transcript into a technically sound how-to article.
v1: An f-string
When we started, this feature was shipped in a day. It was a direct API request to OpenAI, and the prompt was simply an f string in our Python codebase. This worked for about 2 to 3 months, as we got our initial customers and began to grow.
v2: LangSmith
The first set of hurdles were the ones most LLM tools aim to solve: observability, iterating on the prompt, and exploring newer LLMs across providers.Without tooling, our workflow was horrendous. We’d copy paste user requests, try iterating on the prompt on our local machines, and then try again with different user requests.So we started using LangSmith (well before public access!), which hits exactly these use cases with its request logging and prompt management solution. Small updates went from taking hours to minutes. For example, when a customer complained that the word 'Next' was appearing too often, we could instantly ship an update by fixing the prompt.
v3: Vellum
The deeper problem here is that in our case the LLM output is intricately linked to our post processing routines in the code. We use a rich text editing framework, so LLM output needs to be converted from a raw string into a structure supported in this framework.Any significant enhancements to the output (for example, Markdown support for headings) necessitates an update to the post-processing to handle this case. This meant the super meaningful updates were still being done with the same workflow as the f string.That’s when we discovered Vellum: Vellum’s approach is to handle your entire LLM workflow from their platform. This means managing which model to use, managing your pre and post processing - alongside the other expectations from LLM tooling.We’re very early in our journey using Vellum, but I’m definitely more bullish on this approach.
v?: The unsolved problem
Any improvements on an AI feature involve testing against diverse sets of data. Right now, I manually categorise test data that we create to hit various use cases. Since I’m not consistent with this, we have a recency bias in our data every time we try to improve the way the feature works.Something that can hook onto our dev environment traffic and automatically curate datasets out of LLM requests would be incredibly useful. All tools right now offer only manual ways to add requests to datasets.
PS: more tools out there need to put security first! That’s a big factor when we make a decision on something that directly deals with customer data, and what got me to book a demo with the Vellum team in the first place.