Concepts
Safety Gate
The Safety Gate is a two-stage pipeline. An input-side heuristic runs before the LLM is ever called. A Gate Stack of six post-generation gates then evaluates the proposed AI response. Every decision is explainable and every block fires a webhook to a human, not into a void.
Decisions
PROCEED— normal flow. The response is delivered.HOLD— surfaces the exchange to a clinician / operator but does not block. Used for ambiguous language.BLOCK— refuse the interaction, firesafety.blocked, return a policy fallback. The LLM output never reaches the end-user.FLAG— soft annotation. Decision staysPROCEED; the flag is recorded for analytics and downstream policy use.
Input-side check (pre-LLM)
heuristic_safety_check in behavioral_analyzer.pyruns on the incoming message. If a category fires, the LLM call is skipped and a category-specific referral fallback is emitted directly.
self_harm— explicit + euphemistic patternsviolence— credible threats and plansillegal— hacking, money laundering, counterfeitingpii_extraction— attempts to elicit SSN, cards, passwordssexual_minors— zero tolerancelegal_advice_request(VB-002) — "am I going to be sued", "is this contract enforceable", etc.financial_advice_request(VB-003) — "should I buy TSLA", specific investment advice
The two new categories fall back to a referral, not silence:
legal_advice_request →
"I can't offer legal advice. A lawyer in your jurisdiction will give
you a much better read on this than I can."
financial_advice_request →
"I can't recommend specific investments. A licensed financial advisor
can help you weigh your situation properly."The Gate Stack (output-side)
Six gates run in order against the proposed AI response. The first to return a non-PROCEED verdict short-circuits the stack (except FLAG, which accumulates).
| Gate | Default verdict | Triggers |
|---|---|---|
| ValuesBoundary | BLOCK | therapeutic_claim, medical_prescription, legal_advice, financial_advice, self_harm_encouragement, third_party_pii (SSN + 16-digit card) |
| SocialRisk | HOLD | inappropriate-reply patterns: tone mismatch, frequency pressure, power dynamics |
| InactionGuard | HOLD | suspiciously short responses, excessive hedging, refusal-to-engage shape |
| OverclaimGate (VB-103) | FLAG | guarantee ("I guarantee", "100% guaranteed", "zero risk"), certainty ("definitely will", "no doubt"), outcome_prediction ("this will solve all your problems", "you will get the job") |
| EmotionalDependenceGate (VB-104) | FLAG | permanence_promise ("I'll always be here"), exclusive_availability ("count on me for anything", "anytime day or night"), identity_merging ("I understand you better than anyone") |
| DissentGate | noise | V2 opt-in (dissent_enabled=False default). Injects bounded scoring noise to prevent over-deterministic gate behavior. |
Dependence cooldown
When the same end-user trips EmotionalDependenceGatethree or more times within a 60-minute window, a cooldown engages:
- TimingEngine adds a
+30sdelay component. context.tone_dependence_modifierflips to"boundary_respectful".dependence_cooldownon the response surfaces{active: true, recent_count: N}.
VB-006 PII scrub (output-side)
Two-tier policy applied to the proposed response before it leaves the gate stack:
- Tier 1 — SSN, credit card → hard-block via ValuesBoundary.
- Tier 2 — email, phone, address, DOB, passport, driver's license → masked in place with typed markers (
[REDACTED-EMAIL],[REDACTED-PHONE], etc.) so the conversational shape of the reply is preserved.
Each scrub is reported on the response under pii_scrub:
"pii_scrub": [
{ "category": "email", "marker": "[REDACTED-EMAIL]" },
{ "category": "phone", "marker": "[REDACTED-PHONE]" }
]Explainability
Every gate decision returns the matched phrases / patterns. You can render them in your own UI:
"safety": {
"action": "BLOCK",
"risk_score": 0.92,
"flags": ["self_harm"],
"gate": "ValuesBoundary"
},
"analysis": {
"safety_matched": [
{ "phrase": "kill myself", "category": "self_harm" }
],
"safety_override": null,
"safety_confidence": "heuristic"
}Policy-level overrides
Your own YAML policies can override gate decisions — e.g. forceHOLD on messages that touch PII even if the input check said PROCEED:
version: 1
rules:
- name: pii_guard
priority: 98
when:
message_contains: ["my ssn", "social security", "credit card"]
actions:
- type: override_safety
action: HOLD
- type: fire_event
event: proactive.pii_attempt
cooldown: 1mWebhooks on block
Every BLOCK fires safety.blocked with the gate name, matched flags, and a preview of the message. Connect a Slack / Discord integration and the on-call clinician sees it in the channel within ~2 seconds.