Skip to content
all work
Agentic AIInfra· 2025 · Big Kitty Labs

CI/CD & Eval Harness for Agents

Operationalized agent evaluation — gating task success, grounding, latency, and drift on every deploy with LangSmith.

roleLead engineer
impactEvery deploy — eval-gated
LangSmithGitHub ActionsPythonDockerTerraform

Context

Agent behavior is non-deterministic — a prompt tweak or model bump can silently regress grounding or blow up latency. Manual spot-checks didn't scale and didn't catch drift.

We needed reproducible, automated evaluation wired into the same CI/CD that ships the code.

Approach

I built a LangSmith-backed eval harness with curated datasets and scorers for task success, retrieval grounding, output quality, and latency.

GitHub Actions runs the suite on every PR and deploy, publishing a scorecard and blocking merges that regress beyond threshold — responsible-AI guardrails as a continuous engineering process.

  PR / deploy
      │
      ▼
┌──────────────┐    ┌──────────────────────┐
│ GitHub Action│───▶│ LangSmith eval suite │
└──────────────┘    │  success · grounding │
      │             │  quality · latency   │
      │             └──────────┬───────────┘
      ▼                        ▼
  scorecard ◀──────────  pass / block gate

Outcome

Every change now ships with an evidence trail. Regressions are caught pre-merge, drift is tracked release-over-release, and non-AI engineers can extend the suite themselves.

4 axes
success · grounding · quality · latency
Per-PR
automated scorecards
Drift
tracked release-over-release