Concepts
Safety Gate
The Safety Gate is a pre-LLM classifier. It evaluates every incoming message before any LLM call, returns a PROCEED /HOLD / BLOCK decision, and exposes exactly which phrases drove the verdict. Blocks fire a webhook to a human, not into a void.
Why before the LLM?
If you only moderate afterthe LLM responds, you've already spent tokens and (worse) the model has already entered a problematic state. Humane's gate runs on the incoming message plus the user's behavioral profile, so BLOCK decisions short-circuit the whole LLM call.
The three actions
PROCEED— normal flow, LLM is called.HOLD— surfaces the message to a clinician / operator but does not block. Use for ambiguous language.BLOCK— refuse the interaction, firesafety.blocked, return a policy message. LLM is never called.
How it decides
The gate uses an LLM-powered classifier with a heuristic fallback (category patterns + safe-phrase overrides to prevent false positives). You can inspect exactly what matched:
"safety": {
"action": "BLOCK",
"risk_score": 0.92,
"flags": ["self_harm"]
},
"analysis": {
"safety_matched": [
{ "phrase": "kill myself", "category": "self_harm" }
],
"safety_override": null,
"safety_confidence": "heuristic"
}Built-in categories
The default heuristic includes:
self_harm— explicit + euphemistic patternsviolence— credible threats and plansillegal— hacking, money laundering, counterfeitingpii_extraction— SSN, credit cards, passwords, etc.sexual_minors— zero tolerance
Each vertical template tightens or loosens these defaults. For example, thetherapy-coach template sets self_harmsensitivity to 0.95 and enables mandatory_escalation.
False-positive overrides
"I killed it on the test!" should not trigger a block. The heuristic layer runs a safe-overridespass that neutralises known idiomatic uses. When an override fires you'll see it in the response:
"analysis": {
"safety_override": "'killed it' recognised as a benign idiom"
}Policy-level overrides
Your own YAML policies can override the gate decision — e.g. forceHOLD on messages that touch PII even if the heuristic said PROCEED:
version: 1
rules:
- name: pii_guard
priority: 98
when:
message_contains: ["my ssn", "social security", "credit card"]
actions:
- type: override_safety
action: HOLD
- type: fire_event
event: proactive.pii_attempt
cooldown: 1mWebhooks on block
Every block fires safety.blocked with the matched flags and a preview of the message. Connect a Slack / Discord integration and your clinician sees it in the channel within 2 seconds.