Concepts

Safety Gate

The Safety Gate is a pre-LLM classifier. It evaluates every incoming message before any LLM call, returns a PROCEED /HOLD / BLOCK decision, and exposes exactly which phrases drove the verdict. Blocks fire a webhook to a human, not into a void.

Why before the LLM?

If you only moderate afterthe LLM responds, you've already spent tokens and (worse) the model has already entered a problematic state. Humane's gate runs on the incoming message plus the user's behavioral profile, so BLOCK decisions short-circuit the whole LLM call.

The three actions

  • PROCEED — normal flow, LLM is called.
  • HOLD — surfaces the message to a clinician / operator but does not block. Use for ambiguous language.
  • BLOCK — refuse the interaction, fire safety.blocked, return a policy message. LLM is never called.

How it decides

The gate uses an LLM-powered classifier with a heuristic fallback (category patterns + safe-phrase overrides to prevent false positives). You can inspect exactly what matched:

json
"safety": {
  "action": "BLOCK",
  "risk_score": 0.92,
  "flags": ["self_harm"]
},
"analysis": {
  "safety_matched": [
    { "phrase": "kill myself", "category": "self_harm" }
  ],
  "safety_override": null,
  "safety_confidence": "heuristic"
}
Explainability is the point
Raw LLM moderation APIs return a boolean. Humane returns the list ofexact phrases that triggered the block + the category hierarchy they matched. Compliance officers love this.

Built-in categories

The default heuristic includes:

  • self_harm — explicit + euphemistic patterns
  • violence — credible threats and plans
  • illegal — hacking, money laundering, counterfeiting
  • pii_extraction — SSN, credit cards, passwords, etc.
  • sexual_minors — zero tolerance

Each vertical template tightens or loosens these defaults. For example, thetherapy-coach template sets self_harmsensitivity to 0.95 and enables mandatory_escalation.

False-positive overrides

"I killed it on the test!" should not trigger a block. The heuristic layer runs a safe-overridespass that neutralises known idiomatic uses. When an override fires you'll see it in the response:

json
"analysis": {
  "safety_override": "'killed it' recognised as a benign idiom"
}

Policy-level overrides

Your own YAML policies can override the gate decision — e.g. forceHOLD on messages that touch PII even if the heuristic said PROCEED:

yaml
version: 1
rules:
  - name: pii_guard
    priority: 98
    when:
      message_contains: ["my ssn", "social security", "credit card"]
    actions:
      - type: override_safety
        action: HOLD
      - type: fire_event
        event: proactive.pii_attempt
        cooldown: 1m

Webhooks on block

Every block fires safety.blocked with the matched flags and a preview of the message. Connect a Slack / Discord integration and your clinician sees it in the channel within 2 seconds.