Stop Letting LLMs Do What Code Can Do
Here’s a workflow I keep seeing. Someone has a database full of rankings data. They point an LLM at it with a schema file and ask “what changed week over week?” The model reads the tables, compares the numbers, and spits out a summary.
This works. For a while.
Then the model gets updated and the output format shifts. Or it hallucinates a trend that isn’t there. Or you can’t reproduce last week’s analysis because the model was feeling different that day. Or you’re paying per-token for what amounts to arithmetic.
The right way to do this: have AI help you write the SQL query once. Run that query every week, forever. It gives you the same structured output every time. Then hand those final numbers to an LLM and say “explain what happened here.” The model does the one part that actually requires language ability. Everything else is code.
This seems obvious when you spell it out. But almost nobody builds this way.
LLMs Everywhere, For No Good Reason⌗
The rankings example is one instance of a pattern that’s everywhere right now. People are building multi-step workflows where every single step is an LLM call. Summarize the input. Classify the results. Compare two datasets. Format the output. Generate a report. Five steps, five model calls, five chances to introduce variance.
Here’s the math on that. If each LLM step is 95% reliable (generous, honestly), five steps compounds to about 77% reliability. Ten steps gets you to 60%. That’s a coin flip with extra steps. And you’re paying for every one of those flips.
The compounding problem gets worse as models change. I wrote about the “pragmatism” rebrand a few weeks ago, and one thing I didn’t get into is what model updates do to workflows that depend on specific model behavior. Every time a provider ships a new version or adjusts weights, your carefully tuned pipeline can shift. Not break dramatically. Just… drift. Output formats change slightly. Classification thresholds move. The summary that used to be three paragraphs is now five. Your downstream steps, also LLM calls, now have different inputs than they were tuned for.
The whole chain is only as reliable as its least stable step, and every step is a model call. This is fragile by design.
Meanwhile, most of these steps are doing things that code handles perfectly. String matching. Date comparison. Arithmetic. Filtering rows. Sorting by a column. Formatting output into a template. These are solved problems. A Python script does them the same way every time, runs in milliseconds, costs nothing, and won’t change behavior because someone at OpenAI tweaked a hyperparameter.
Why People Build This Way⌗
The easy explanation is that people can’t code. But that’s not quite right. A lot of the people building these LLM-heavy workflows are perfectly capable of using AI to write code. I can’t really code either, and I build tools all the time with AI assistance.
The real issue is proximity to software. If you’ve spent time around codebases, around databases, around the kinds of problems that developers solve daily, you develop an instinct for what can be deterministic. You hear “compare two weeks of data” and your brain says “that’s a query.” You hear “find the rows where this value dropped more than 10%” and your brain says “that’s a WHERE clause.”
If you haven’t spent time in that world, those same tasks sound like reasoning. They sound like something you need intelligence for. “Compare these two datasets and tell me what changed” feels like it requires understanding. It doesn’t. It requires subtraction.
And then there’s the path of least resistance. Typing “tell me what changed this week” into a chat window is easier than writing a query. It just is. The overhead of figuring out the right approach, writing the code, testing it, making sure it works. That’s effort. Asking the LLM to just do it all? That’s a sentence.
The problem is that “easier right now” and “more reliable over time” are almost always in tension. The sentence you type this week works. The sentence you type next week might give you a different format. The code you write this week works next week, and the week after that, and the week after that.
The tooling doesn’t help. n8n, Make, Zapier, LangChain - they all push you toward “just add another AI node.” That’s their business model. Every integration they build makes it easier to chain another model call into your workflow and harder to drop down to actual code. The path of least resistance runs straight through the LLM.
Where LLMs Actually Belong⌗
So where should you actually use a model? The boundary becomes clear when you ask one question: does this task require judgment about language?
Explaining what a set of numbers means to a human. Writing a narrative summary that adapts tone for the audience. Interpreting ambiguous user input where the intent isn’t clear from the text alone. Drafting something that needs to sound like a person wrote it. These are language tasks. Models are good at them.
Everything else has a deterministic solution. Math. Comparison. Filtering. Transformation. Formatting. Deduplication. Validation against a schema. Routing based on rules. All of this can be code. All of this should be code.
The test I use: could a Python script do this if I gave it the right inputs? If yes, it should be a Python script. The LLM’s job is to help you write that script, not to be that script.
This is where the taste question comes back in. AI made building easy. But building the right way still requires understanding what you’re building. Knowing that “compare two datasets” is a code problem and “explain why this matters” is a language problem - that’s a judgment call. And the judgment is the part AI can’t hand you.
The Better Pattern⌗
The pattern that actually works long-term is simple. Use AI to write code. Run the code. Hand the LLM only what requires language ability.
For the rankings workflow, that looks like:
- SQL query pulls the week-over-week changes. Deterministic. Repeatable. Runs in milliseconds. Free.
- Structured output - the query results land in a clean format. JSON, CSV, whatever your downstream process needs. Same format every time.
- LLM reads the numbers and writes the narrative. This is the one step that genuinely needs a model. “Rankings for [keyword cluster] dropped 12% this week, concentrated in informational intent pages. Commercial pages held steady.” That interpretation, that read on what matters, that’s what language models are for.
One model call instead of five. The deterministic parts are deterministic. The language part is language. If the model updates tomorrow, your data pipeline still works. The only thing that might change is the wording of the summary, and wording is cheap to adjust.
Yes, this is more work up front. You have to think about what the query should return. You have to get the code right. You have to test it. But you do that work once. The all-LLM approach makes you do the work every time, because you’re re-prompting the same tasks and hoping for consistent output.
And this pattern scales. Every workflow where people are chaining LLM calls can be decomposed the same way. Figure out which steps are actually language tasks. Write code for everything else. You’ll end up with fewer model calls, lower costs, faster execution, and a pipeline that doesn’t drift when Anthropic ships a new version of Claude.
The Skill That Matters⌗
Most people draw the line between “needs a model” and “can be code” way too far toward the model. Partly because they don’t know where code’s capabilities end. Partly because the LLM is right there and it’s easy. Partly because the tooling nudges them that direction.
The skill worth developing isn’t better prompting. It’s getting close enough to software to know what’s possible with deterministic tools. Understanding that databases can compare things. That scripts can filter and format. That code can do math without hallucinating.
You don’t have to become a developer. I’m not one. But you need enough proximity to know when you’re reaching for a language model to do something a for loop handles. Build that instinct, and you’ll build things that actually hold up.