Concepts

Safety Gate

The Safety Gate is a two-stage pipeline. An input-side heuristic runs before the LLM is ever called. A Gate Stack of six post-generation gates then evaluates the proposed AI response. Every decision is explainable and every block fires a webhook to a human, not into a void.

Decisions

  • PROCEED — normal flow. The response is delivered.
  • HOLD — surfaces the exchange to a clinician / operator but does not block. Used for ambiguous language.
  • BLOCK — refuse the interaction, fire safety.blocked, return a policy fallback. The LLM output never reaches the end-user.
  • FLAG — soft annotation. Decision stays PROCEED; the flag is recorded for analytics and downstream policy use.

Input-side check (pre-LLM)

heuristic_safety_check in behavioral_analyzer.pyruns on the incoming message. If a category fires, the LLM call is skipped and a category-specific referral fallback is emitted directly.

  • self_harm — explicit + euphemistic patterns
  • violence — credible threats and plans
  • illegal — hacking, money laundering, counterfeiting
  • pii_extraction — attempts to elicit SSN, cards, passwords
  • sexual_minors — zero tolerance
  • legal_advice_request (VB-002) — "am I going to be sued", "is this contract enforceable", etc.
  • financial_advice_request (VB-003) — "should I buy TSLA", specific investment advice

The two new categories fall back to a referral, not silence:

text
legal_advice_request →
  "I can't offer legal advice. A lawyer in your jurisdiction will give
   you a much better read on this than I can."

financial_advice_request →
  "I can't recommend specific investments. A licensed financial advisor
   can help you weigh your situation properly."

The Gate Stack (output-side)

Six gates run in order against the proposed AI response. The first to return a non-PROCEED verdict short-circuits the stack (except FLAG, which accumulates).

GateDefault verdictTriggers
ValuesBoundaryBLOCKtherapeutic_claim, medical_prescription, legal_advice, financial_advice, self_harm_encouragement, third_party_pii (SSN + 16-digit card)
SocialRiskHOLDinappropriate-reply patterns: tone mismatch, frequency pressure, power dynamics
InactionGuardHOLDsuspiciously short responses, excessive hedging, refusal-to-engage shape
OverclaimGate (VB-103)FLAGguarantee ("I guarantee", "100% guaranteed", "zero risk"), certainty ("definitely will", "no doubt"), outcome_prediction ("this will solve all your problems", "you will get the job")
EmotionalDependenceGate (VB-104)FLAGpermanence_promise ("I'll always be here"), exclusive_availability ("count on me for anything", "anytime day or night"), identity_merging ("I understand you better than anyone")
DissentGatenoiseV2 opt-in (dissent_enabled=False default). Injects bounded scoring noise to prevent over-deterministic gate behavior.

Dependence cooldown

When the same end-user trips EmotionalDependenceGatethree or more times within a 60-minute window, a cooldown engages:

  • TimingEngine adds a +30s delay component.
  • context.tone_dependence_modifier flips to "boundary_respectful".
  • dependence_cooldown on the response surfaces {active: true, recent_count: N}.

VB-006 PII scrub (output-side)

Two-tier policy applied to the proposed response before it leaves the gate stack:

  • Tier 1 — SSN, credit card → hard-block via ValuesBoundary.
  • Tier 2 — email, phone, address, DOB, passport, driver's license → masked in place with typed markers ([REDACTED-EMAIL], [REDACTED-PHONE], etc.) so the conversational shape of the reply is preserved.

Each scrub is reported on the response under pii_scrub:

json
"pii_scrub": [
  { "category": "email", "marker": "[REDACTED-EMAIL]" },
  { "category": "phone", "marker": "[REDACTED-PHONE]" }
]

Explainability

Every gate decision returns the matched phrases / patterns. You can render them in your own UI:

json
"safety": {
  "action": "BLOCK",
  "risk_score": 0.92,
  "flags": ["self_harm"],
  "gate": "ValuesBoundary"
},
"analysis": {
  "safety_matched": [
    { "phrase": "kill myself", "category": "self_harm" }
  ],
  "safety_override": null,
  "safety_confidence": "heuristic"
}
Explainability is the point
Raw LLM moderation APIs return a boolean. Humane returns the gate that fired, the exact phrases that triggered it, and the category hierarchy they matched. Compliance officers love this.

Policy-level overrides

Your own YAML policies can override gate decisions — e.g. forceHOLD on messages that touch PII even if the input check said PROCEED:

yaml
version: 1
rules:
  - name: pii_guard
    priority: 98
    when:
      message_contains: ["my ssn", "social security", "credit card"]
    actions:
      - type: override_safety
        action: HOLD
      - type: fire_event
        event: proactive.pii_attempt
        cooldown: 1m

Webhooks on block

Every BLOCK fires safety.blocked with the gate name, matched flags, and a preview of the message. Connect a Slack / Discord integration and the on-call clinician sees it in the channel within ~2 seconds.