Article2026-07-05

How Lovable Saved $20 Million a Year by Shrinking Its System Prompt

Bora Lee

Founder, Modern Web Labs

A Lovable engineer overhauled the product's bloated system prompt over the year-end holidays. Removing duplication and tightening the language made responses about 4% faster and cut annual LLM costs by roughly $20 million. A case where the instruction budget principle turned into measurable results.

The principle that an agent drifts as you add more rules was covered in the instruction budget article. This time, let's look at a case where that principle was applied to a real product and delivered results. Benjamin Verbeek, an engineer at Lovable, overhauled the entire system prompt during the 2025 year-end holidays, and by the numbers he shared, responses became about 4% faster and annual LLM costs dropped by roughly $20 million.¹ The result came not from adding anything to the prompt but from taking things out.

How the prompt got bloated

Lovable's system prompt was not bloated from the start. It grew one instruction at a time as engineers optimized the features they each owned. Someone would add a sentence to 'make the model better at X'; later someone else would notice 'Y got weaker' and add another sentence. Over the holidays Benjamin read the LLM traces end to end and spotted this vicious cycle. The prompt had accumulated the engineers' thought processes as they were, and the result was duplication, contradiction, and verbosity.

Looked at feature by feature, each instruction had its own rationale. The problem was the whole. As instructions accumulate, the model's ability to follow them degrades evenly, so the sum of the parts was eating away at overall performance. The exact pattern that turns a CLAUDE.md into a ball of mud had reproduced itself in a production system prompt.

Fundamentals, not a new technique

There was no special invention in the fix. Remove duplication, tighten the language, and keep a whole-prompt perspective so that emphasizing one feature does not break the overall balance.

There was an interesting aspect to how Benjamin worked. Benjamin cleaned up the first few paragraphs by hand to set the standard, then delegated the rest of the work to Claude Opus with the instruction 'follow the style and tone of what I have already edited'. He then reviewed the generated output line by line, manually restoring or reinforcing the parts that mattered. A human set the bar, the model handled the scale, and the human reviewed again. That flow is why the prompt could shrink substantially while preserving the core intent of the existing instructions.

How an experimental physicist ships

With a background in experimental physics, Benjamin designed the rollout like an experiment too. Even choosing the holidays was deliberate. With almost nothing else changing during that period, any regression could be pinned on the prompt change. Over the break Benjamin hand-tested edge cases against the new system prompt, ran internal benchmarks and eval sets, rolled out to a small user group first, and monitored the metrics through a gradual rollout.

What we should learn from this case is the verification system more than the prompt edits themselves. Without an environment where you can experiment safely, a system prompt becomes code everyone is afraid to touch, and it keeps growing with nobody ever cleaning it up.

Results and lessons

After the new system prompt rolled out to all users, the Lovable team confirmed that responses were about 4% faster, design quality visibly improved, and instruction-following accuracy improved in AB tests as well. With token usage down, annual LLM costs fell by roughly $20 million.¹ The key point is that the prompt got 'smaller' and performance got 'better'. There is a limit to how many instructions a model can digest, so removing unnecessary instructions restores compliance quality for the ones that remain, exactly as the Lovable case shows.

Benjamin's takeaways come down to three points.

Prompt quality compounds at scale. An improvement that looks marginal at small scale becomes a large gap in cost and performance under heavy traffic.
A whole-prompt perspective beats 'instructing harder'. Removing duplication and inefficiency across the whole prompt works better than continually reinforcing individual parts.
A fast, safe experimentation environment is the best asset. Good evals and gradual rollouts are what make bold cleanup possible.

Wrapping up

A system prompt is not an artifact you write once and move on from. It is an asset you revisit as a whole on a regular basis. Especially in agentic products where the system prompt is the core of the product, the real optimization lies in removing the unnecessary and making things clear, not in adding more instructions. At scale, the effect comes back as numbers: response speed and cost.

To apply the same principle to the CLAUDE.md and AGENTS.md in your own project, see the instruction budget article.

Benjamin Verbeek (@benjaminvrbk), X thread, January 2026. The response-speed and cost figures are the values he shared in the thread.
↩

Newsletter

Notes vetted by enterprise practitioners, every two weeks.

Notes on Claude Code, GitHub Copilot, AI-native engineering strategy, and adoption case studies, curated every two weeks.

Modern Web Labs · Consulting

You read it. Now bring it into your team.

If the patterns in this post fit your situation, start with a short conversation about how to apply them.

How we can help

AI-Native Strategy
Redesign operating standards, measurement, and governance
Claude Code · GitHub Copilot
Two-day hands-on plus AI-graded in-house certification
Web Platform
Building full-stack services on Next.js

Start a Conversation