THE PRIMARY METRIC · PLATFORM IMPACT

The one number that measures AI interaction quality.

Behavior Score combines five behavioral signals into a single 0-100 composite. It measures the thing your product team actually cares about: whether users come back, go deeper, and feel heard.

SEE YOUR SCORE How it's computed →

ILLUSTRATIVE · REFERENCE DATA

From 42 to 78 in four weeks.

A typical AI product starts at 35-45 on Behavior Score — instant replies, no memory, generic tone. After wiring in Humane's behavioral engines, the composite climbs as retention recovers, depth grows, and proactive nudges start earning replies.

Week 1 baseline42

Week 2 (memory on)58

Week 3 (safety gates)68

Week 4 (proactive)78

THE FIVE COMPONENTS

Five behavioral signals. One weighted composite.

Retention Delta (Day 7)

WEIGHT 30%

What % of new users return in the week after first use. The single clearest signal that the AI interaction is valuable enough to come back to.

BASELINE: 30-40% for typical LLM apps
EXCEPTIONAL: 70%+ means the product is habit-forming

CONTRIBUTES UP TO

OF 100

Conversation Depth

WEIGHT 25%

Average messages per session. Shallow conversations mean users hit a ceiling fast — deep ones mean the AI knows enough to stay useful.

BASELINE: 3-5 messages per session without behavioral memory
EXCEPTIONAL: 10+ messages signals sustained engagement

CONTRIBUTES UP TO

OF 100

Proactive Response Rate

WEIGHT 20%

Of the AI-initiated nudges sent (reminders, check-ins, follow-ups), what % get a reply within 2 hours. Measures whether the AI is earning attention without being prompted.

BASELINE: N/A — most LLM apps never initiate. 0 by default.
EXCEPTIONAL: 50%+ reply rate is industry-leading

CONTRIBUTES UP TO

OF 100

Abandonment (inverse)

WEIGHT 15%

% of engaged users who stopped mid-conversation and haven't returned in the window. Lower is better — the score inverts this so it contributes positively.

BASELINE: 40-60% abandonment is standard
EXCEPTIONAL: <20% abandonment is category-leading

CONTRIBUTES UP TO

OF 100

Sentiment Trend

WEIGHT 10%

Rolling 7-day average of user message sentiment (0-1). Captures whether the interaction is getting better or worse for real users, not just logged in terms of usage.

BASELINE: ~0.5 (flat) for typical apps
EXCEPTIONAL: 0.85+ means the relationship is warming

CONTRIBUTES UP TO

OF 100

HOW IT'S COMPUTED

A transparent weighted sum. No black box.

score = 0.30 × retention_delta_7d
      + 0.25 × conversation_depth
      + 0.20 × proactive_response_rate
      + 0.15 × abandonment_inverse
      + 0.10 × sentiment_trend

Every component is first normalized to 0-100 using published baselines. We publish the weights because investors and CTOs deserve to audit the math. Weights are not tuned per customer — you can benchmark apples to apples across any Humane-powered product.

Refreshed hourlyScoped to your tenantExportable as JSON/CSVA/B-testable per variant

BENCHMARK TIERS

Where does your score land?

Range	Tier	What it means
0–30	Poor	Users bounce; AI feels broken.
30–50	Below Average	Signals exist but most users leave.
50–70	Industry Average	Matches typical LLM app baselines.
70–85	Significant Improvement	Users stay, come back, engage.
85–100	Exceptional	Category-leading retention and depth.

The median unoptimized LLM app scores around 42. Most well-tuned products land in the high 60s. Crossing 85 typically signals an interaction so good users describe it unprompted to others.

ANTICIPATED QUESTIONS

The questions we hear most.

How is this different from NPS or CSAT?: NPS and CSAT measure what users say. Behavior Score measures what users do. If someone rates your AI 10/10 but never returns, NPS says great and Behavior Score says the relationship is dying. The gap is usually where the truth lives.
Can I A/B test against Behavior Score?: Yes. Every Humane experiment reports per-variant Behavior Score with statistical significance. Growth and Scale tiers include this in the dashboard; Community users can pull the raw per-variant metrics via the SDK.
How often is it calculated?: The dashboard score refreshes hourly. The API endpoint computes on demand over a configurable window (default 7 days). Trends are smoothed across rolling windows to avoid noise from individual outliers.
Why weighted sum instead of a neural net?: Because investors, CTOs, and compliance officers need to audit the number. A weighted sum is transparent; a neural net on five inputs is theatre. The weights come from reviewing which components had the strongest correlation with 30-day retention across the pilot cohort.
What if my app has no proactive messages?: The proactive_response_rate component contributes zero. Your ceiling drops from 100 to 80. Most Humane customers turn on proactive within the first two weeks once they see how much headroom that 20% contribution represents.
My app is brand new — what do I see?: The score publishes once you have at least 10 end users, 100 messages, and 14 days of history. Before that, the dashboard shows a progress bar with what's left until the metric becomes trustworthy. We'd rather show honest 'not yet' than flatter with thin data.

GO MEASURE

See your product's real Behavior Score in <10 minutes.

Works with any LLM — OpenAI, Anthropic, Bedrock, Gemini, local
No code changes beyond one API call wrapping your existing client
Self-hosted community tier is free forever
Honest empty-state: we tell you when data is still thin

START FREE READ QUICKSTART

The one number that measures AI interaction quality.

From 42 to 78 in four weeks.

Five behavioral signals. One weighted composite.

Retention Delta (Day 7)

Conversation Depth

Proactive Response Rate

Abandonment (inverse)

Sentiment Trend

A transparent weighted sum. No black box.

Where does your score land?

The questions we hear most.

How is this different from NPS or CSAT?

Can I A/B test against Behavior Score?

How often is it calculated?

Why weighted sum instead of a neural net?

What if my app has no proactive messages?

My app is brand new — what do I see?

See your product's real Behavior Score in <10 minutes.