agentic-ai · ai · evals
By Shayan Ghasemnezhad4min read
AI agents degrade silently. The observability stack and eval framework that catches drift before users do.
Traditional software fails loudly. An API returns a 500, a test turns red, a metric crosses a threshold. AI agents fail quietly. The agent still responds, still takes actions, still produces output that looks plausible. But the quality degrades—subtly, gradually, and invisibly until a user reports that “the AI seems worse lately.” By that point, the drift has been compounding for weeks.
A REST API has a contract: given this input, return that output. You can write assertions against it. An AI agent has a goal, not a contract. It reasons about inputs, decides which tools to call, and produces outputs that are correct-ish rather than correct or incorrect. Standard monitoring—latency, error rate, throughput—tells you whether the agent is running. It does not tell you whether the agent is producing good outcomes.
Agents compound the problem by making multi-step decisions. An agent that retrieves documents, synthesises information, and generates a response has three points where quality can degrade: retrieval relevance, synthesis accuracy, and response quality. A drop in retrieval relevance (caused by a change in the vector index or new document types) silently degrades every downstream step.
Build observability around four layers:
Evals are automated tests for AI quality. Unlike unit tests, they produce scores rather than pass/fail. A practical eval pipeline runs on every deployment and on a daily schedule against production data.
Start with a golden dataset: 50–100 queries with known-good answers, covering the common cases and the edge cases your agent handles. Run the agent against this dataset after every model update or prompt change. Score outputs on relevance, factual accuracy, and format compliance. Track scores over time. A 5% drop in accuracy across two consecutive runs warrants investigation.
# Minimal eval runner with scoring
from dataclasses dataclass
ai · distribution
When the technology layer commoditises overnight, what separates lasting companies from wrappers? Where moats form in the AI landscape.
Drift comes from three sources: model updates (the provider ships a new version), data changes (the retrieval index is updated with new content), and prompt changes (a teammate edits the system prompt). Each source needs its own detection mechanism.
For model updates: pin model versions in production. Run evals against new versions in staging before promoting. For data changes: track retrieval quality metrics (precision@k, recall) and alert when they drop. For prompt changes: version-control prompts like code, require review, and run evals on every change.
Invest in observability proportional to the agent’s blast radius. An internal summarisation tool needs basic operational metrics and a weekly eval. A customer-facing agent that takes actions (creates tickets, sends emails, modifies data) needs real-time quality monitoring, automated evals on every deploy, and human-in-the-loop review for edge cases.
The most common failure: relying on user feedback as the primary quality signal. Users report catastrophic failures but not gradual degradation. By the time “the AI seems worse” becomes a support ticket, the quality has been declining for weeks. Proactive evals catch what users tolerate.
Another failure: evals that test the easy cases. If your golden dataset only includes straightforward queries, it will not catch degradation on the edge cases where agents struggle most. Include adversarial inputs, ambiguous queries, and multi-step reasoning tasks in your dataset.
AI observability is not a dashboard—it is a practice. Measure what matters, automate the evals, and treat quality as a metric that ships with the feature, not one that gets added after the first incident.