ai-cost · cloud-waste · finops
By Shayan Ghasemnezhad3min read
GPU instances and inference endpoints have reopened the cloud cost problem that FinOps was starting to solve. Governance needs to catch up.
Cloud cost management was making progress. Teams were tagging resources, right-sizing instances, and buying Savings Plans. Then AI workloads arrived—and the cost curve bent upward again. GPU instances cost 10–40x their CPU equivalents. Inference endpoints run 24/7 whether or not anyone is asking questions. Training jobs can burn through five figures in a weekend. The FinOps playbook that worked for web applications needs new chapters.
Traditional cloud cost management assumes relatively predictable, steady-state workloads. You provision a fleet of instances, they run services, and cost scales roughly with traffic. AI workloads violate every part of this assumption. Training jobs are bursty and unpredictable—a hyperparameter sweep might spin up 50 GPU instances for six hours, then nothing for two weeks. Inference demand is hard to forecast because product teams are still discovering what users do with AI features.
The unit economics are also different. A single p4d.24xlarge instance costs roughly €28 per hour. A team running fine-tuning experiments without cost guardrails can spend more in a day than their entire monthly EC2 budget for non-AI workloads. And unlike CPU instances, GPU instances have limited Savings Plan coverage and sparse spot availability.
Training gets the headlines, but inference is where the ongoing cost lives. A self-hosted model endpoint on a g5.2xlarge costs approximately €1.20/hour —roughly €870/month if it runs continuously. If your AI feature handles 50 requests per hour, that is €0.024 per request. At 5 requests per hour, it is €0.24—an order of magnitude difference in unit cost for the same infrastructure.
The decision between self-hosted inference and managed API (OpenAI, Anthropic, Bedrock) is fundamentally a utilisation question. Managed APIs charge per token with no idle cost. Self-hosted endpoints have high fixed cost and low marginal cost. The crossover point depends on volume, latency requirements, and data residency constraints.
Build governance around three controls: budgets, approval gates, and automated shutdown.
autoscaling · cost-control
Misconfigured resource requests are the top driver of Kubernetes overspend. How to right-size, autoscale, and allocate costs per namespace.
For each AI workload, answer four questions. What is the expected utilisation? (If below 30%, use a managed API instead of self-hosting.) What is the data residency requirement? (If data cannot leave your VPC, self-hosting or Bedrock is mandatory.) What is the latency budget? (If sub-100ms, you need a dedicated endpoint; if 2–5 seconds is acceptable, serverless inference works.) What is the experimentation cadence? (If the team is running daily experiments, invest in a shared training cluster with scheduling; if monthly, on-demand is fine.)
The worst failure is invisible spend. A data scientist spins up a notebook instance on a Friday afternoon, runs a training job, and forgets to shut down the instance. It runs for three weeks. This happens in every organisation that does not enforce auto-stop policies on development instances.
Model sprawl is the AI equivalent of server sprawl. Teams deploy multiple model endpoints for different features without a shared registry. Each endpoint has its own GPU allocation. Consolidate where possible—a single endpoint serving multiple use cases with routing logic is cheaper than three endpoints at 15% utilisation each.
AI cost governance is not optional—it is the difference between AI features that improve margin and AI features that erode it. Build the visibility and controls before the spend forces you to.