Fine-tuning discussions are often framed like matters of taste. One camp says small fine-tuned open models are the future. Another says prompting a frontier model through an API is enough for almost everything.
That framing is usually wrong.
For most real projects, the question is much simpler: for a given task, at a given volume, with a given error profile, what setup gives you the lowest expected cost per event?
That is an economics question, not an ideology question.
There are really two separate decisions:
- Is this task worth automating at all?
- If yes, is fine-tuning the cheapest way to reach the required quality?
Once you build a cost model, the answer is often much less mysterious than people think.
The Wrong Framing
When I talk to customers, peers, or people at conferences, I keep hearing two kinds of statements:
- "This is the year of small open-source models that are fine-tuned."
- "Fine-tuning is the way to go for enterprise."
And on the other side:
- "We already run a large model via API and do huge volumes every day. Fine-tuning probably does not make sense for us, because fine-tuning is expensive and API based is convenient."
Both camps miss the point.
The question is not whether fine-tuning is fashionable. The question is whether it pays off.
The Engineer's Lens
My background is mechanical engineering. The one thing that engineering teaches you is reasoning from first principles. Another key learning: once you move beyond experiments, the first serious question is always economic: does the investment pay off?
Before switching to a new machine, new part, new production process, or even before making any development decisions, you model the economics. That is it. Simple. If you have a thoroughly vetted economic model that you can run "what if" scenarios on, then we talk. Before that, you are just guessing and wasting everybody's time.
You want to know throughput, scrap, oversight cost, downtime risk, and break-even. If you do not know a factor, you make an educated guess. You come back to the guess as time progresses and update it with measured data from the system. Your model gets better over time.
In ML and GenAI, life is often simpler than that. And still, teams regularly make decisions with less rigor.
Production engineering has already trained us to ask useful questions:
- What is the throughput?
- What is the good-part rate?
- What is the scrap rate?
- What is the cost of human oversight?
- Can the system run fully automatic or not?
- Do we need real-time processing, or can we batch?
And then the really important questions:
- What is the direct cost of scrap?
- What is the indirect cost of scrap?
- How does the cost of an error increase the longer it stays in the system?
That mindset transfers almost directly to ML systems. A wrong prediction is just another form of scrap. The mistake is that many teams still evaluate model choices through benchmark scores, gut feeling, or vendor narratives instead of expected business cost per event.
Let us make that concrete with a classification example.
Start With the Data or Accept That You Are Guessing
Before discussing prompting strategy, model choice, or fine-tuning, get the basic inputs right.
For almost any classification or extraction workflow, you need at least these four groups of numbers. The exact categories depend on your domain, but these cover most of the ground.
1. Volume Data
- How many emails, documents, pages, or events do you process?
- Is there seasonality?
- Is the workload steady, bursty, or batch-driven?
2. Time Data
- How long does a human take to classify one item on average?
- How long does a human take to extract the required information?
- How much review time remains even after automation?
Once you know average handling time and hourly wage, you already have a baseline cost per event.
3. Cost of Errors
This is the part many teams skip, and it is often where most of the value sits.
- What happens when the prediction is wrong?
- Does another human fix it later?
- Does it create delay?
- Does it trigger rework downstream?
- Does it hurt customer satisfaction?
- Does it create compliance or escalation risk?
Even if the estimate is rough, assign a number. A rough number is better than pretending the cost is zero.
4. System Cost
- API inference cost
- hosting cost
- training or fine-tuning cost
- evaluation cost
- monitoring cost
- human-in-the-loop cost
- retraining or relabeling cost
Without this, you cannot answer the actual question.
If collecting all of this feels like a lot of structure to set up from scratch, there is good news: at the end of this post, we packaged the full question bank and the formulas into a skill you can drop into your favorite AI coding agent. It will walk you through exactly these inputs.
A Note on Human Benchmarking
In some environments, especially where worker protections are strong, teams hesitate to measure individual handling performance. Fair enough. But that does not remove the economic problem.
If you cannot measure per person, measure per team:
- total hours per week
- total volume per week
- total rework volume
A good estimate is still far better than flying blind. If you cannot get even rough numbers, that is usually not an AI problem. It is a data and operations problem.
A Note on Hard-to-Measure Metrics
Some of the most important costs are hard to measure directly: customer satisfaction, brand damage, escalation probability, regulatory risk, or the cost of making a customer wait one day too long.
That does not mean they should be ignored. It means someone has to make a decision and assign a rough value.
I often hear things like: "We want to improve NPS by 0.1 points." Fine. What is that worth? If nobody in the room can even make an approximate economic argument, then the KPI may be directionally useful, but it is not yet usable for investment decisions.
This is not KPI shaming. Soft metrics matter. But if they matter enough to drive budgets, then somebody has to put some skin in the game and translate them, however roughly, into business value.
Worked Example: Triage at a Mid-Sized Insurance Company
Let us make this more tangible.
Imagine a mid-sized insurer with a large operations team handling incoming written interactions: claims notifications, billing questions, policy changes, proof-of-insurance requests, cancellation requests, and supporting documents.
Some of these come in as plain emails. Some come through a portal. Many arrive with attachments that still need to be routed to the right queue.
Assume:
- 5,000 inbound written interactions per day
- 10 top-level classes, each with 10 subclasses (100 possible routes)
- average human handling time for classification: 30 seconds
- average hourly labor cost: €20
Now define the business impact of mistakes. In this setting, the real cost is not just labor. It is delay, rework, repeat contact, escalation risk, and in some cases service-level or compliance risk.
To keep the example concrete, use a management proxy instead of pretending this number falls out of physics.
With an average customer ARR of €600, the team might price one day of avoidable delay at roughly €4 — covering repeat contact, churn risk, and satisfaction impact. (€600/year ÷ 250 working days ≈ €2.40/day; using €4/day means assuming that the operational downside of delay is materially larger than the customer's average daily revenue contribution.)
This is not a universal truth. It is a working estimate. If you dislike the number, replace it with low, base, and high scenarios and rerun the model.
- 1 business day of avoidable delay on a customer interaction costs €4 on average
- if an interaction goes to the wrong department, it usually takes about 2 extra business days to recover
- if the top-level department is correct but the subclass is wrong, it usually creates about half a day of additional delay
That means:
- 1 day of delay = €4
- wrong department (2 days delay) = €8
- wrong subclass (0.5 day delay) = €2
Already, that tells us something important: the dominant cost may not be model inference. It may be classification mistakes.
This is a trap I see often in ML. Teams focus on inference cost because it is easy to measure and mentally hide everything that is messy or indirect. But the process does not care which part of the cost structure was easy to log.
ML engineers need more of a process-owner mindset. And business owners need to assign value to the hard-to-measure parts as well. Otherwise you optimize the cheapest visible part of the system and ignore the expensive invisible part.
First Baseline: Human Cost Per Event
The direct labor cost for manual classification:
Call it €0.17 per event. That is your floor for manual processing before you even consider error cost.
Why Accuracy Alone Is Not Enough
A plain accuracy score is often the wrong metric. Not all mistakes cost the same.
A wrong subclass is annoying. A wrong department is much worse.
So instead of looking only at a raw confusion matrix, collapse it into business-relevant error buckets and assign cost to each bucket.
In the general case, the expected error cost is:
Where is the true class, is the predicted class, is the probability of predicting when the true label is (read directly from your confusion matrix), and is the business cost assigned to that specific mistake.
If the system also sends a share of cases to human review at cost , then the all-in expected cost per event becomes:
That extra review term matters because many real automations are not fully automatic. They are partially automatic with selective human fallback.
For a full 100-class routing setup, that can become a large matrix. In practice, group cells into a few business-relevant buckets as long as the grouped mistakes really do have similar downstream cost.
| Outcome | Business meaning | Cost |
|---|---|---|
| Correct route | Correct dept and subclass | €0 |
| Wrong subclass | Correct dept, wrong subclass | €2 |
| Wrong department | Wrong top-level class | €8 |
This is the step many teams never do. But once you do it, you can compare humans, prompted models, and fine-tuned models on the same axis: expected euros per event.
Turn the Confusion Matrix Into Money
Assume the current human process:
- top-level (dept) accuracy: 95%
- full-route accuracy: 80%
That implies: 80% fully correct, 15% right dept / wrong subclass, 5% wrong department.
| Outcome | Share | Cost | Contribution |
|---|---|---|---|
| Correct route | 80% | €0 | €0.00 |
| Wrong subclass | 15% | €2 | €0.30 |
| Wrong department | 5% | €8 | €0.40 |
| Total error cost | €0.70 |
Add the direct handling cost:
- manual handling: €0.17
- expected error cost: €0.70
- total expected cost per interaction: €0.87
At 5,000 interactions per day: €4,350 / day.
Do Not Tune the Inputs Until the Output Feels Nice
One thing you absolutely should not do is start changing assumptions just because the first number feels uncomfortable.
I have seen this multiple times. A team does the first pass, sees a big cost number, and immediately starts nudging the assumptions until the result looks emotionally acceptable.
That is exactly the wrong move.
If you have a good reason to change an assumption, change it. But do not lower the cost of a bad customer experience just to make the spreadsheet feel nicer.
Sometimes the right response to an uncomfortable number is to ask whether your value model is incomplete. Maybe the delay cost is lower than you thought. Maybe it is actually higher because quick responses improve retention or satisfaction, thus falling behind means you leave money on the table.
The point is to get a cost model that keeps everybody honest.
Prompted Model vs Fine-Tuned Model
At this point the first decision is already visible: any automated option that gets materially below the manual baseline of €0.87 per interaction is worth considering. (Remember, human cost + cost of error)
Now we can ask the second question — across all 5,000 daily interactions, since every one of them needs to be classified regardless of what happens downstream.
To keep the comparison readable, assume there is no mandatory human review on every interaction and that any selective benchmarking and evaluation workload is already folded into the ops terms below. If your process still reviews a fixed share of cases, add that residual review cost explicitly with the formula above.
Suppose you compare two candidate systems.
Option A — Prompted General Model
Assume:
- full-route accuracy: 93%
- wrong subclass: 5%
- wrong department: 2%
- inference and ops cost: €0.06 per interaction
Expected variable cost: 0.05 × €2 + 0.02 × €8 + €0.06 = €0.32 / event
Option B — Fine-Tuned Model
This is where fine-tuning tends to shine. For tasks like classification and routing, a fine-tuned model of decent size will rarely lose to a prompted frontier model. The reason is simple: the opaque decision boundaries that make routing hard — which product, which channel, which contract type routes where — are easier to learn from labeled examples than to describe in a prompt. Company-specific intricacies are easier trained than prompted. The best part? If you introduce new classes later, you can always add a "prompt escape hatch" for the model even after training. In the end it still an LLM. Example Prompt: "If the presented data does not fit any of the classes, route it to the new class |XYZ| or send to a human via |HUMAN|".
Assume:
- full-route accuracy: 96%
- wrong subclass: 3%
- wrong department: 1%
- inference, serving, and routine ops cost: €0.015 per interaction
Expected variable cost: 0.03 × €2 + 0.01 × €8 + €0.015 = €0.155 / event
Incremental variable savings: €0.165 / event → €825 / day → €74,250 over 90 days.
If the fine-tuning project costs €40,000 all-in as a fixed investment: break-even at roughly 49 days, clearly positive at 180 days (€148,500 gross savings over 180 days, or €108,500 net after the project cost).
One important caveat: in many real projects, the expensive part is not the training run itself. It is data preparation, labeling cleanup, evaluation, integration, and operating the thing properly afterward. And "afterward" is not a one-time event. Someone has to monitor model quality over time, retrain when the distribution shifts, maintain the serving infrastructure, and handle the inevitable edge cases that only surface in production.
All of those ongoing costs — MLOps time, retraining cycles, serving and monitoring infrastructure — should be folded into . If you only count the initial project as your investment and ignore the running cost of keeping the system alive, your model will look too optimistic.
That is exactly why volume and horizon matter so much. If the eligible volume was much lower, or the misclassification cost was smaller, fine-tuning may not be worth the additional complexity. The answer comes from the economics, not from fashion.
The Real Decision Rule
Define:
- = number of relevant predictions over the decision horizon
- = all-in variable cost per event of the prompted system
- = all-in variable cost per event of the fine-tuned system
- = fixed fine-tuning investment over the horizon
Then fine-tuning is worth it when:
That is the core inequality. Everything else is implementation detail.
If you prefer to amortize the project investment into , that is fine too. Just set in the formula and make sure you do not count the same cost twice.
Reverse the Formula
Once a prompted model is already cheaper than a human, the next practical question is often not "Should we fine-tune?" but:
- What is the maximum amount we are allowed to invest in fine-tuning?
- Or, if we already know the investment budget, what is the maximum per-prediction cost difference we can tolerate between the prompted model and the fine-tuned model?
From here on, let mean fixed fine-tuning investment that has not already been amortized into the per-event cost.
Define:
- = number of relevant predictions over the horizon
- = all-in variable cost per event of the prompted model
- = all-in variable cost per event of the fine-tuned model
- = total fixed fine-tuning investment over the horizon
Break-even condition:
Maximum justified fine-tuning investment:
This gives you a hard ceiling. If the project costs more than that, the economics do not work.
Same Token Cost Assumption
If you assume the prompted model and the fine-tuned model have the same direct prediction cost, then the difference is driven only by error reduction. In that case, let and be the expected error costs per event:
That is often a very helpful sanity check. It tells you how much accuracy improvement is worth in euros before you even discuss infrastructure details.
Maximum Allowed Prediction Cost Difference
Now take the other direction.
Suppose you already know the total fine-tuning investment budget , and you want to know how much more expensive the fine-tuned prediction is allowed to be compared with the prompted baseline.
Define:
- = expected error cost per event of the prompted model
- = expected error cost per event of the fine-tuned model
- = direct inference cost per event of the prompted model
- = direct inference cost per event of the fine-tuned model
Then the maximum allowed fine-tuned prediction cost premium is:
Equivalent form:
If that value is negative, the message is simple: your fine-tuned model does not just need to be better — it also needs to be cheaper to run.
Try the Numbers on Your Own Case
Defaults match the insurance example above. Adjust any input — results update instantly.
Training cost is amortised over all events in the evaluation horizon. Break-even is calculated on gross savings before amortisation. Wrong rates are derived from accuracy inputs: wrong dept = 1 − top-level acc, wrong subclass = top-level acc − full-route acc. All figures are illustrative.
A Note on Organizational Readiness
The formulas above are agnostic about who does the work. But in practice, the decision to fine-tune is not just an economic one — it is also a capability question.
Running a fine-tuned model in production means somebody in your organization needs to own the pipeline: data versioning, training runs, evaluation, deployment, monitoring, and retraining when the world changes. That does not require a huge ML platform team — the tooling has gotten dramatically easier — but it does require operational discipline and at least a small team that is comfortable with the workflow.
If your organization does not have that capability today, the cost of building it becomes part of . If it does, much of that cost is already sunk and the marginal cost of one more fine-tuned model is lower than people assume.
This is why the framework in this post is more naturally suited to organizations with some MLOps maturity. If you are a two-person startup with no ML infrastructure, the overhead of standing up a fine-tuning pipeline may dominate the economics regardless of what the per-event math says. A prompted API call that costs more per event but requires zero infrastructure may still be the rational choice — for now.
The good news: this is not a permanent constraint. As tooling improves and as your volume grows, the break-even point shifts. The model tells you when.
A Note on Supply Chain Risk
There is a mirror image of the organizational readiness argument that people bring up far less often.
When teams say "We just call an API, so we do not have to manage a model," they are making an implicit assumption: that calling an API is maintenance-free. It is not.
You are still running a model. You just do not own it. And that creates its own risk profile:
- Silent distribution shifts. The provider updates or replaces the model behind the endpoint. Your prompts still run, but accuracy may change. If you are not running evaluations regularly, you will not notice until downstream metrics degrade.
- Model deprecation. Models get retired. If your workflow depends on a specific model version, you may be forced to migrate on someone else's timeline.
- Throughput constraints. You cannot scale an API endpoint indefinitely. At high volume, you need quota increases, which require lead time and sometimes negotiation. During demand spikes, you may hit rate limits you did not plan for.
- No control over the training cycle. If the model's behavior drifts in a way that hurts your use case, your only lever is prompt engineering. You cannot retrain, you cannot freeze a version indefinitely, and you cannot inject your own data to correct the drift.
In supply chain terms, this is a single-supplier dependency with limited contractual control over the product specification. In German manufacturing, we would call this a Lieferkettenrisiko — a supply chain risk. The mental model is the same: you trade operational complexity for external dependency, and that dependency has a cost that belongs in your model.
None of this means API-based inference is wrong. For many use cases, especially at lower volumes, it is the clearly rational choice. But the comparison should be honest. If you count MLOps cost for the fine-tuned path, you should also count evaluation overhead, migration risk, and throughput constraints for the API path.
Where Fine-Tuning Usually Makes Sense
The equation tells you whether fine-tuning pays off. But there are also use-case patterns where it tends to make sense more often than people realize.
1. When You Already Have Labeled Data and Did Not Notice
One of the best cases for fine-tuning is when training data already exists because humans have been labeling it for operational reasons anyway.
Email classification is a classic example. Most companies already have historical email traffic that ended up in a department, got forwarded, got corrected, or got resolved by a certain team. That operational history is often an implicit label source.
The same is true for many document-heavy workflows:
- document routing
- packet splitting
- extracting key information into UI masks
- assigning documents to specialists
If somebody in the past designed the process well enough that humans were already creating structured outcomes, you already paid a good chunk of the labeling cost. That is a serious head start.
And where the routing history is messy or ambiguous, you can still use humans as judges to clean up the tail of the distribution instead of starting from zero.
2. When the Decision Boundary Lives in People's Heads
Another strong signal is whenever you hear phrases like:
- "You just need to know."
- "It becomes obvious once you have worked here for a while."
- "This one is tricky because of how product A interacts with product B."
Those are signs that the decision boundary is real, valuable, and poorly documented.
Take email classification again. The first few classes may be easy. But then you get opaque routing logic built up over years: if it references one product but comes from another channel, and mentions a certain contract type, then it belongs somewhere unexpected.
At that point you have two choices:
- clean up the process and simplify the rules
- learn from historical labeled behavior
Sometimes the correct answer is process redesign. Sometimes the correct answer is fine-tuning on the accumulated operational data.
3. When People Fear That New Classes Mean Starting Over
"If we add a new class, do we have to retrain everything?"
Usually, no.
With LLM-based classifiers, you are often not building a rigid prediction head that must be redesigned every time the taxonomy changes. You are teaching the model to output a label or structured token sequence. That is more flexible than people assume.
Taxonomy changes are not free — they still need evaluation, data updates, and operational discipline. But they are not the catastrophic reset that some teams imagine.
4. When Batch Economics Matter
There is another case where fine-tuning can become very attractive: batch-heavy workloads.
Yes, many API providers offer batch inference. But often those windows are around 24 hours, and 24 hours is just a bit too slow for many operational processes. You may not need strict real-time, but you also do not want to wait until tomorrow.
That is where a fine-tuned model served in your own environment can get interesting. You can accumulate enough work, run batch prediction on your own schedule, and drive the cost down.
This can mean:
- a scheduled ECS task
- a containerized batch worker
- a self-hosted inference service
- a serving framework optimized for throughput and caching
The point is not that self-hosting is automatically better. The point is that once volume is high enough and latency constraints are awkward enough, the economics can shift very quickly.
Food for Thought: Multi-Layer Cost Structures
Everything in this post models a single layer: routing accuracy and its direct cost implications. But real processes rarely have just one layer.
In the insurance example, correctly classifying an email is only the first gate. After it has been routed to the correct department, a specialist in that team opens the correspondence, reads it, and handles it — finding the right template, checking eligibility, or preparing a response. Specialists typically carry a higher cost structure than the routing layer: more domain knowledge, higher hourly rates, and scarcer capacity. For a large share of incoming interactions — FAQ-style requests, standard acknowledgements, routine document requests — that handling step is largely mechanical. A human still does it, but it is the kind of work that an agent system with access to the right business logic could own. That handling cost is real, sits downstream of the routing decision, and is not modelled here.
This creates a multi-layer benefit structure: improvements in upstream classification accuracy unlock downstream automation potential. If the routing is wrong, the downstream handler — whether human or agent — starts from a bad position. If the routing is right and the intent is unambiguous, you have the option to automate the response entirely.
The broader point: upstream efficiency is a prerequisite for downstream automation. Strong routing is not proof that fully automated handling is safe, but it is one of the gates you have to pass before that conversation is even serious. Once each layer clears its own quality bar, the economics can compound.
Practical Takeaways
- Do not ask "Should we fine-tune?" before asking "What does an error cost us?"
- Do not use raw accuracy as your primary decision metric when different mistakes have different business consequences.
- Collapse the confusion matrix into economically meaningful buckets.
- Compare options on expected cost per event, not on benchmark vanity.
- Separate the decision to automate from the decision to fine-tune.
- Do not massage assumptions until the output feels emotionally acceptable.
- Look for hidden label sources in existing operational systems.
Conclusion
Apply this framework to any ML or GenAI project you are running right now.
If you can estimate:
- volume
- handling time
- hourly labor cost
- confusion matrix or error bucket rates
- business cost per error type
- system cost
then you can usually answer the fine-tune-or-not question much faster than most people expect.
And if you cannot answer it, that is useful too. It means the next problem to solve is not model architecture. It is measurement.
Grab the Skill
If you want to turn this framework into a working prep workflow, I packaged it as a two-file skill you can drop into any coding agent with skill support.
- SKILL.md — the agent instructions (drop this into your skills folder)
- REFERENCE.md — formulas, question bank, and the worked example from this post
It is designed for BDs, AI strategists, product leads, and solution architects. Instead of giving you generic AI advice, it pushes you to quantify the use case, identify missing assumptions, and generate the right questions to take back to operations, finance, compliance, or the business.
Outlook
In follow-up posts, I want to go deeper into how to estimate the true cost of fine-tuning projects (spoiler: training is rarely the expensive part) and fine tune an LLM for information extraction to mimic a real use case and break down all the cost of the fine tuning that we incured.
Hope to see you there!