I dabbled in data science years ago but it’s just not my thing.
The end product of creating a model is incredible, but long ago I conceded that the day-to-day of training and building one is too mundane for me.
LLMs brought back that early 2016 AI excitement that made me buy a statistics course.
I never managed to get too excited about probabilities, but I can immediately see opportunities where LLMs can be used to build better products.
I know very little about the technical architecture behind an LLM. My layman’s understanding of them stretches as far as embeddings go, and I won’t try to explain something I can’t really comprehend.
I’ll only share my experience integrating these models into products, and how I think about them.
The Lack of Determinism
Validating external input like forms and request payloads is a regular part of the job. But LLMs introduce a level of indeterminism inside the application that I’m just not used to.
Usually, I know that if I validate external data against a schema, I can trust it from there on. Most of my logic can be written in the form of pure functions and simple objects that pass data between each other.
That’s one of the main principles I follow and it’s helped me a lot with maintainability.
But a call to OpenAI’s API is the exact opposite of idempotent. Call it ten times and you will get ten different answers. Ask the LLM to return a JSON and you will get it… most of the time. You can pass it a schema and it will follow it… unless it hallucinates another field.
This is not a criticism of LLMs, that’s just how they work for now. But having unreliable behavior in the heart of a product’s business logic is something I’m still wrapping my head around.
You could compare calls to LLMs to database queries or 3rd party service requests - there’s a chance that they fail.
But failures with them are usually related to them not being able to respond at all. If the database returns data it provides it in the structure you expect. An LLM can return a response but its level of correctness can vary
Velocity
By the time I wrote the section above, it was already out of date.
OpenAI’s API now have additional parameters that can guarantee a JSON response, and they have a determinism setting that will allow you to get the same response given the same prompt.
Things are running with a “2016 JavaScript” pace right now.
There’s this popular framework called Langchain which is used to create more complex LLM flows. It helps you with things like feeding your model with data, refining questions by running them through multiple models, and creating agents.
I was reading the docs, and when I looked up an issue in GitHub the methods I was using were already deprecated.
Looking for help and direction, I got on a call with a consultant and ended up giving them pointers. I keep looking for fundamental principles in everything I do, but the field seems too new to have these formed.
RAGs seem to be the way to go
Retrieval Augmentation Generation (RAG) pipelines are one of the common and sensible patterns companies use when building LLM applications. It’s a way to provide a model with additional context without going through the arduous process of training one.
Despite their vast knowledge, LLMs are limited to what data they’re trained on, and when they don’t have an answer they will gladly hallucinate one for you.
A RAG pipeline sounds more daunting than it is. Essentially you’re storing your text data in the form of vectors and when you have a request for the LLM, you first look up relevant documents by doing a vector search. Then you pass them as context to the model so it doesn’t hallucinate things it doesn’t know.
This is an excellent in-depth article about them.
Training models isn’t the way
At least for most small teams.
The main reason for that is the slower iteration cycle. Training a model is a notoriously slow operation and everyone who’s worked with data science teams can attest to that.
You can train your own model on top of GPT but this is more expensive, slower, and harder to fix than using a prompt or a RAG. You get a much faster iteration cycle by providing it with specific data upon every request.
Changing the prompt is a matter of a single pull request. Re-training the model seems like a more complicated process.
Errors & Testing
Every request sent to the model is an API call and there’s the non-zero possibility that it fails because of a network error. It happens.
But this opens up more errors and behaviors to think about. When should you retry the request? How many times? What kind of failure was it? Was the model’s answer too long for a single request?
If you execute an operation and the LLM fails in a subsequent step you might have to roll back the previous data that you have stored.
Another question that I asked a lot of people is how they test LLMs. If your business is built around one you need some guarantee that your prompts are producing the expected result. But because of their indeterminism, I wasn’t sure how this could happen.
Most people I asked told me they were vectorizing what they would consider a “correct” LLM response, and they were doing cosign similarity checks. That’s a lot different from the tests I’m used to writing.
Streaming Structured Responses
Waiting for the full response of the LLM to return before you display something on the screen doesn’t lead to the best user experience. Naturally, we want to stream the response word by word, just like OpenAI is doing.
But if the LLM responds with JSON, we’re in trouble. It returns it character by character so until the response completes, the structure is broken and it will lead to an error.
To solve that I had to resort to optimistically closing the JSON with a library.
This helps us solve the problem when the property we’re reading is a string. But if we want it to return an array of references, for example, then that makes parsing even harder and forces us to add additional checks in our code.
Good usecases, bad usecases
Maybe the marketing got to me but I expected the LLM to be a computational wonder you can hand off any task to. I learned the hard way that they’re especially bad at computation after I kept asking Chat GPT to count the items in an array for a few hours.
As far as I understand it, they predict the answer they need to give based on your prompt. So they’ll be close with their estimation but not reliably correct.
They’re very good at recognizing intent, though.
I had a case where I wanted the user to submit a command written in plain English that would result in an element getting created in the UI. Essentially, I needed to map text to an object I could render.
The object had to follow a specific schema, though. Although Chat GPT got quite good at producing objects with this structure, the hallucinations regularly put the UI in a broken state.
That made users create the element using the regular controls which beat the whole purpose of having an AI assistant.
But I realized that creating the JSON wasn’t the hard part. You don’t need AI for that. You need it to recognize the intent of the user and map it to a function in your codebase you can then call to reduce the non-determinism.
Turns out this is an actual approach. It’s called function calling and is frequently used for such cases.
Prompt engineering will be a skill rather than a discipline
I think prompt engineering will go the way testing is going.
Maybe there will be a few professionals who will focus entirely on that in the future, maintaining agents, fleshing out prompts, and finding the best way to present data to the LLM.
But I see it as a skill that every engineer will need to learn. The same way we have tests living next to our React components, we will have prompts.