Concepts

Anomaly Detection

Humane watches four behavioral + operational signals per tenant and fires an alert the moment any of them drift more than 2σ from a rolling 30-day baseline. It's the engine that turns “nobody noticed until the churn report” into “engineering got a Slack ping within 15 minutes.”

What it watches

The first release ships four metrics — tuned to be noisy enough to catch real problems but quiet enough to avoid alert fatigue:

Sentiment drop — per end-user. Recent 24h average sentiment versus their own 30-day baseline. Catches users who are quietly souring on your product before they churn.
Engagement drop— per end-user. Interactions in the last 24h versus their daily average over 30 days. Flags users who were regulars and suddenly aren't.
Latency spike— tenant-wide. Last hour's P50 latency versus the last 24 hourly snapshots. Catches upstream LLM slowness, DB contention, and infrastructure hiccups.
Error-rate spike — tenant-wide. Webhook 5xx rate in the last hour versus the last 24 hours. Catches a failing subscriber endpoint before it tanks your alerting pipeline.

Severity — and why we use z-scores

Every detection carries a z_score measuring how many standard deviations the observed value sits from its baseline. Severity maps directly from the absolute z:

text

|z| >= 4.0  →  critical   (+ email to tenant owner)
|z| >= 3.0  →  warning   (webhook only)
|z| >= 2.0  →  info      (dashboard only)
|z|  < 2.0  →  not persisted (normal variation)

Standard-deviation thresholds give you a principled alert cadence — no hand-tuning per-customer. A latency spike on a tenant that normally runs 100 ms ± 50 ms triggers at 200 ms. A tenant that normally runs 100 ms ± 5 ms triggers at 110 ms. Same code, right threshold for each tenant.

Direction matters

Sentiment going upisn't an anomaly — it's great news. The engine only fires on deviations in the “bad” direction for each metric: sentiment/engagement only fire on drops, latency/error-rate only fire on spikes. You won't get paged when things improve.

Baselines + sample-size guards

The engine refuses to fire when the baseline is too thin to be trustworthy:

Minimum 20 baseline samples — below this, the stats are too noisy and we skip rather than alarm.
Baseline standard deviation must be > 0 — a flat metric can't have anomalies; dividing by zero σ would produce infinite z-scores.
Webhook error rate needs ≥ 5 deliveries in the window — one failure on a quiet tenant doesn't deserve a 100%-error-rate alarm.

These guards mean brand-new tenants see "baseline too small"in their audit log until real traffic accumulates. Intentional: we'd rather be silent than cry wolf.

Deduplication

Without dedup, the 15-minute sweep would create a new row every tick for as long as a problem persists — flooding the dashboard and the webhook. A 2-hour cooldown per(tenant, metric_name, end_user_id) stops the flood:

text

sweep t=0   sentiment_drop on user_42 z=-2.8  → CREATE anomaly
sweep t=15  sentiment_drop on user_42 z=-3.1  → skip (cooldown)
sweep t=30  sentiment_drop on user_42 z=-2.9  → skip (cooldown)
...
sweep t=120 sentiment_drop on user_42 z=-2.7  → CREATE anomaly #2

If a condition remains unresolved after 2 hours, the engine fires again — you see the severity possibly shifting (warning → critical), which itself is signal.

How alerts reach you

Dashboard — Engine Monitorrenders the detector's last 7 days with severity badges, baseline stats, and a one-click “Acknowledge” button.
Webhook event anomaly.detected — fires on every severity level. Payload includes metric name, z-score, observed + baseline, and the scope (system-wide or specific end-user id). Wire to Slack, PagerDuty, Opsgenie, whatever.
Email to tenant owner — only for critical severity. Delivered via yournotify_audit_warnings preference (unsubscribable per RFC 8058).

Webhook payload shape

json

{
  "id": "5b8d3c... (anomaly UUID)",
  "metric": "sentiment_drop",
  "severity": "warning",
  "z_score": -3.14,
  "observed_value": 0.31,
  "baseline_mean": 0.62,
  "baseline_std": 0.09,
  "sample_size": 420,
  "description": "Sentiment drop: observed 0.310 vs baseline 0.620 ± 0.090 (z=-3.14, n=420)",
  "end_user_id": "user_42",          // null when system-wide
  "user_id": "tenant-uuid",
  "detected_at": "2026-04-25T10:45:12Z"
}

Same HMAC signing as every other Humane webhook

Every anomaly.detectedevent is signed with your webhook's HMAC-SHA256 secret. Verify with the same X-Humane-Signature parser as your other handlers — see the Webhooks guide.

REST API

Dashboards, Slack bots, or on-call triage flows can query anomalies directly. All endpoints are tenant-scoped via the API key.

http

GET  /api/anomalies?severity=warning&since_hours=24&resolved=false
GET  /api/anomalies/stats?since_hours=168
GET  /api/anomalies/{id}
POST /api/anomalies/{id}/resolve   { "note": "upstream LLM recovered" }

Retention

Anomaly rows live for 90 days then are pruned by a nightly sweep (04:19 UTC). Resolved or not — 90 days is plenty for postmortems and audit review. If you need longer retention (regulated industries sometimes do), export via the API to your own data warehouse.

Tuning thresholds

The defaults — 20 samples, z≥2/3/4 tiers, 2-hour cooldown, 15-minute sweep — are intentionally conservative. If you find the engine too quiet, contact support and we'll adjust per-tenant. In the open-source community tier, edit app/anomaly_detector.py directly — the tunables are constants at the top of the file.

Philosophy — silence over noise

Every missed anomaly costs you one user who could have been saved. Every false alarm costs you a tiny piece of operator trust. We err hard toward “quiet unless the signal is strong” because rebuilding trust in an alerting system takes months.

Proactive Triggers

A/B Experiments