THE PRIMARY METRIC · PLATFORM IMPACT

The one number that measures AI interaction quality.

Behavior Score combines five behavioral signals into a single 0-100 composite. It measures the thing your product team actually cares about: whether users come back, go deeper, and feel heard.

ILLUSTRATIVE · REFERENCE DATA

From 42 to 78 in four weeks.

A typical AI product starts at 35-45 on Behavior Score — instant replies, no memory, generic tone. After wiring in Humane's behavioral engines, the composite climbs as retention recovers, depth grows, and proactive nudges start earning replies.

Week 1 baseline42
Week 2 (memory on)58
Week 3 (safety gates)68
Week 4 (proactive)78
78BEHAVIOR SCORE

THE FIVE COMPONENTS

Five behavioral signals. One weighted composite.

01

Retention Delta (Day 7)

WEIGHT 30%

What % of new users return in the week after first use. The single clearest signal that the AI interaction is valuable enough to come back to.

BASELINE
30-40% for typical LLM apps
EXCEPTIONAL
70%+ means the product is habit-forming
CONTRIBUTES UP TO
30
OF 100
02

Conversation Depth

WEIGHT 25%

Average messages per session. Shallow conversations mean users hit a ceiling fast — deep ones mean the AI knows enough to stay useful.

BASELINE
3-5 messages per session without behavioral memory
EXCEPTIONAL
10+ messages signals sustained engagement
CONTRIBUTES UP TO
25
OF 100
03

Proactive Response Rate

WEIGHT 20%

Of the AI-initiated nudges sent (reminders, check-ins, follow-ups), what % get a reply within 2 hours. Measures whether the AI is earning attention without being prompted.

BASELINE
N/A — most LLM apps never initiate. 0 by default.
EXCEPTIONAL
50%+ reply rate is industry-leading
CONTRIBUTES UP TO
20
OF 100
04

Abandonment (inverse)

WEIGHT 15%

% of engaged users who stopped mid-conversation and haven't returned in the window. Lower is better — the score inverts this so it contributes positively.

BASELINE
40-60% abandonment is standard
EXCEPTIONAL
<20% abandonment is category-leading
CONTRIBUTES UP TO
15
OF 100
05

Sentiment Trend

WEIGHT 10%

Rolling 7-day average of user message sentiment (0-1). Captures whether the interaction is getting better or worse for real users, not just logged in terms of usage.

BASELINE
~0.5 (flat) for typical apps
EXCEPTIONAL
0.85+ means the relationship is warming
CONTRIBUTES UP TO
10
OF 100

HOW IT'S COMPUTED

A transparent weighted sum. No black box.

score = 0.30 × retention_delta_7d
      + 0.25 × conversation_depth
      + 0.20 × proactive_response_rate
      + 0.15 × abandonment_inverse
      + 0.10 × sentiment_trend

Every component is first normalized to 0-100 using published baselines. We publish the weights because investors and CTOs deserve to audit the math. Weights are not tuned per customer — you can benchmark apples to apples across any Humane-powered product.

Refreshed hourlyScoped to your tenantExportable as JSON/CSVA/B-testable per variant

BENCHMARK TIERS

Where does your score land?

RangeTierWhat it means
030PoorUsers bounce; AI feels broken.
3050Below AverageSignals exist but most users leave.
5070Industry AverageMatches typical LLM app baselines.
7085Significant ImprovementUsers stay, come back, engage.
85100ExceptionalCategory-leading retention and depth.

The median unoptimized LLM app scores around 42. Most well-tuned products land in the high 60s. Crossing 85 typically signals an interaction so good users describe it unprompted to others.

ANTICIPATED QUESTIONS

The questions we hear most.

How is this different from NPS or CSAT?

NPS and CSAT measure what users say. Behavior Score measures what users do. If someone rates your AI 10/10 but never returns, NPS says great and Behavior Score says the relationship is dying. The gap is usually where the truth lives.

Can I A/B test against Behavior Score?

Yes. Every Humane experiment reports per-variant Behavior Score with statistical significance. Growth and Scale tiers include this in the dashboard; Community users can pull the raw per-variant metrics via the SDK.

How often is it calculated?

The dashboard score refreshes hourly. The API endpoint computes on demand over a configurable window (default 7 days). Trends are smoothed across rolling windows to avoid noise from individual outliers.

Why weighted sum instead of a neural net?

Because investors, CTOs, and compliance officers need to audit the number. A weighted sum is transparent; a neural net on five inputs is theatre. The weights come from reviewing which components had the strongest correlation with 30-day retention across the pilot cohort.

What if my app has no proactive messages?

The proactive_response_rate component contributes zero. Your ceiling drops from 100 to 80. Most Humane customers turn on proactive within the first two weeks once they see how much headroom that 20% contribution represents.

My app is brand new — what do I see?

The score publishes once you have at least 10 end users, 100 messages, and 14 days of history. Before that, the dashboard shows a progress bar with what's left until the metric becomes trustworthy. We'd rather show honest 'not yet' than flatter with thin data.

GO MEASURE

See your product's real Behavior Score in <10 minutes.

  • Works with any LLM — OpenAI, Anthropic, Bedrock, Gemini, local
  • No code changes beyond one API call wrapping your existing client
  • Self-hosted community tier is free forever
  • Honest empty-state: we tell you when data is still thin