# Reference — AI Project Economics Buddy

## Baseline questions

Use these to build the current-process cost model. Work top-down: throughput first, then handling cost, then error cost, then constraints.

### Throughput

- How many events (emails, documents, tickets, etc.) per day/week/month?
- Is the workload steady, seasonal, or bursty?
- Is real-time processing required, or can you batch?

### Handling cost

- How long does a human take per event on average?
- Who does the work today (role, seniority)?
- What is the fully loaded hourly cost? If unknown, estimate from role and region.
- If individual measurement is sensitive, use team-level: total hours/week / total volume/week.

### Error economics

First, define error buckets that reflect real business consequences — not just "correct vs wrong." Example for a routing/classification task:

| Bucket | What happens | Example cost driver |
|--------|-------------|-------------------|
| Fully correct | Nothing | €0 |
| Minor error (e.g. wrong subclass, right department) | Small rework, short delay | Rework time + minor delay cost |
| Major error (e.g. wrong department) | Re-routing, multi-day delay, possible escalation | Delay cost + escalation risk |
| Critical error (e.g. compliance failure) | Legal/regulatory exposure | Compliance penalty + remediation |

For each bucket, ask:
- What happens operationally when this error occurs?
- Who catches and fixes it?
- How long does recovery take?
- What is the approximate cost per occurrence?

**The scenario method for soft costs:** When a cost is hard to measure directly (customer satisfaction, brand damage, churn risk), ask: "If this metric got 20% worse for a quarter, what would the plausible business downside be?" That usually gets you to a better number than abstract KPI talk.

**Never default an unknown cost to zero.** Use low/base/high estimates instead. Missing estimates make the entire model blind.

### Oversight and residual human cost

- What share of events still needs human review after automation?
- What does that review cost per event?
- Are there compliance or audit requirements for human oversight?

### Constraints

- SLA or latency requirements?
- Privacy, data residency, or compliance constraints?
- Batch window availability vs real-time need?

---

## Decision formulas

### Cost per event (general)

```
C_total = E[error cost] + r_review * C_review + C_system
```

Where:
- `E[error cost] = sum over all (i,j) of P(predict j | true i) * cost(i,j)` — read from confusion matrix, weighted by business cost
- `r_review` = share of events sent to human review
- `C_review` = cost of one human review
- `C_system` = inference + hosting + monitoring + maintenance per event

### The core decision inequality

Fine-tuning (or any higher-investment approach) is worth it when:

```
N * (C_baseline - C_candidate) > I_candidate
```

Where:
- `N` = number of events over the decision horizon
- `C_baseline` = all-in variable cost per event of the current best option
- `C_candidate` = all-in variable cost per event of the higher-investment option
- `I_candidate` = fixed investment (data prep, training, integration, eval, monitoring setup)

This works for any pairwise comparison: human vs prompted model, prompted model vs fine-tuned, fine-tuned vs agent system, etc.

### Reverse: maximum justified investment

```
I_max = N * (C_baseline - C_candidate)
```

This is the hard ceiling. If the project costs more than this, the economics do not work. Bring this number to budget discussions.

### Reverse: maximum allowed cost premium

If you know the budget and want to know how much more expensive the candidate system is allowed to be per event:

```
delta_P_max = (E_baseline - E_candidate) - I_candidate / N
```

If negative: the candidate needs to be both better AND cheaper to run.

---

## Worked example pattern

**Insurance email triage** (from the blog post "To Fine-Tune or Not?"):

- 5,000 inbound interactions/day
- 100 possible routes (10 departments x 10 subclasses)
- Human handling: 30s/event at €20/hr = €0.17/event
- Error buckets: correct (€0), wrong subclass (€2), wrong department (€8)
- Human baseline: 80% correct, 15% wrong subclass, 5% wrong dept
- Human expected error cost: 0.15 * €2 + 0.05 * €8 = €0.70/event
- Human total: €0.87/event

Prompted model: 93% correct, €0.06 system cost → €0.32/event
Fine-tuned model: 96% correct, €0.015 system cost → €0.155/event

Fine-tuning saves €0.165/event → €825/day → break-even on a €40k project in ~49 days.

Use this as a reference pattern when helping users structure their own analysis. The specific numbers are illustrative — the structure is what matters.

---

## Data readiness checklist

When fine-tuning or supervised approaches are on the table, check:

- [ ] Do labels already exist in operational history? (routing decisions, corrections, approvals)
- [ ] Is the ground truth trustworthy, or does it need cleanup?
- [ ] How often does the taxonomy or target schema change?
- [ ] Are there "you just need to know" rules that signal undocumented decision boundaries?
- [ ] Is there enough volume in tail classes, or is the distribution heavily skewed?
