Technical Safety Requirements | AI Research

1. Input Validation for Crisis Content

Any system that accepts user input in a mental health context must validate that input for crisis signals before processing.

Multi-Layer Detection Architecture

🔍 Layer 1: Pattern Matching (Fast, Reliable)

Keyword and pattern matching for explicit crisis content. Should execute in <10ms before any other processing.

EXPLICIT_PATTERNS = [
    # Suicidal ideation - direct
    r'\b(kill|end)\s*(my)?self\b',
    r'\b(want(ing)?|going)\s*to\s*die\b',
    r'\bsuicid(e|al)\b',
    r'\b(don\'?t|do\s*not)\s*want\s*to\s*(live|be\s*alive)\b',
    
    # Self-harm - direct
    r'\bcut(ting)?\s*(my)?self\b',
    r'\bself[\s-]?harm\b',
    r'\bhurt(ing)?\s*myself\b',
    
    # Methods
    r'\b(pills?|overdose|hang(ing)?|jump(ing)?)\b',  # Context required
    
    # Crisis indicators
    r'\b(crisis|emergency|911)\b',
    r'\b(no\s*(one|body)\s*(cares?|would\s*miss))\b'
]

🧠 Layer 2: Semantic Analysis (Implicit Detection)

ML-based semantic analysis for implicit crisis signals. Many users express distress indirectly.

Implicit Signal	Examples	Detection Approach
Hopelessness	"Nothing will ever change," "What's the point"	Sentiment + temporal markers
Farewell language	"Just want to say goodbye," "You've been a good friend"	Farewell pattern classifier
Giving away possessions	"I want you to have my..."	Context-specific patterns
Feeling like a burden	"Everyone would be better off without me"	Burden + absence classifier

📊 Layer 3: Contextual Assessment

Analysis of conversation history and patterns over time:

Escalating distress across messages
Sudden mood changes
Time-of-day patterns (late night distress)
Repeated themes of worthlessness/hopelessness
Disengagement from previously engaging topics

Confidence Scoring and Thresholds

Conservative Thresholds

Crisis detection should favor false positives over false negatives. A threshold of 0.4-0.5 (rather than typical 0.5+) is recommended for escalation triggers. The cost of missing a crisis far exceeds the cost of unnecessary resource provision.

def crisis_score(message, context):
    """
    Returns crisis score 0-1 and recommended action
    """
    scores = {
        'explicit_match': check_explicit_patterns(message),
        'semantic_risk': semantic_classifier(message),
        'context_escalation': context_analysis(context),
        'temporal_risk': time_pattern_analysis(context)
    }
    
    # Explicit match always triggers
    if scores['explicit_match'] > 0.8:
        return 1.0, 'IMMEDIATE_ESCALATION'
    
    # Weighted combination for implicit
    combined = (
        scores['semantic_risk'] * 0.4 +
        scores['context_escalation'] * 0.3 +
        scores['temporal_risk'] * 0.3
    )
    
    if combined > 0.4:  # Conservative threshold
        return combined, 'ESCALATION_REQUIRED'
    elif combined > 0.2:
        return combined, 'ELEVATED_MONITORING'
    else:
        return combined, 'NORMAL'

2. Output Filtering Requirements

All AI-generated output must be filtered before delivery to users.

Prohibited Output Categories

Category	Examples	Implementation
Method information	Specific self-harm methods, lethal doses, etc.	Hard block; never generate
Diagnostic statements	"You have depression," "This sounds like BPD"	Pattern detection; redirect to professional
Treatment advice	"You should take medication," "Stop taking your meds"	Hard block; defer to prescriber
Delusional validation	Agreeing with paranoid or delusional content	Reality-testing classifer; neutral response
Relationship claims	"I love you," "I'll always be here for you"	Intimacy pattern detection; constrained language

Crisis Response Templates

When crisis is detected, output should follow tested templates rather than relying on generative responses:

Template: Immediate Crisis (Tier 1)

I'm concerned about what you've shared. This is beyond what I can help with—I'm an AI, and you need to talk to a real person right now.

Please call 988 (Suicide & Crisis Lifeline) now.

Or text HOME to 741741 (Crisis Text Line).

If you're in immediate danger, call 911 or go to your nearest emergency room.

[BUTTON: Call 988] [BUTTON: Text 741741]

Template: Elevated Risk (Tier 2)

What you're describing sounds really difficult. I want to make sure you have support.

Have you considered reaching out to someone who can help? The 988 Lifeline (call or text) connects you with trained counselors 24/7.

[BUTTON: Learn about 988] [BUTTON: Continue, I'm okay]

Hallucination Prevention

Resource verification: All crisis resources (numbers, links) must be verified from a maintained database, never generated
Citation requirements: Claims about evidence must reference verified sources
Factual grounding: Use retrieval-augmented generation (RAG) with verified content
Uncertainty marking: When confidence is low, output should explicitly acknowledge uncertainty

3. Confidence Thresholds for Intervention

Different actions require different confidence levels. Higher-stakes actions require lower confidence thresholds (more conservative).

Action	Confidence Threshold	Rationale
Show crisis resources	0.3	Low cost of false positive; resources always helpful
Escalate to human review	0.4	Human can disambiguate; better safe than sorry
Interrupt AI conversation	0.5	More disruptive; but crisis takes priority
Alert care team	0.5	Clinical action requires reasonable confidence
Suggest specific treatment	N/A	AI should never suggest specific treatment

4. Graceful Degradation Patterns

Systems must handle failure modes safely. When components fail, the system should fail toward safety.

Failure Mode Handling

Failure	Degraded Behavior	User Communication
LLM API unavailable	Fall back to rule-based responses	"I'm having technical difficulties. Here are some resources..."
Crisis detection uncertain	Assume elevated risk; show resources	Provide crisis resources proactively
Human oversight unavailable	Limit AI capability; direct to crisis line	"For the support you need right now, please reach out to..."
Output filter fails	Block all generative output	Display static safety content only

Design Principle

When uncertain, the system should fail toward providing more resources, more human involvement, and less AI autonomy—not the reverse.

5. Audit Trail Requirements

Comprehensive logging is essential for safety monitoring, incident investigation, and quality improvement.

Required Logging

Event Type	Data to Log	Retention
All interactions	Timestamp, session ID, input hash, output hash	90 days minimum
Crisis detection events	Full input/output, confidence scores, action taken	7 years (medical record)
Escalation events	Escalation path, response time, resolution	7 years
Filter triggers	What was blocked, why, what was shown instead	90 days
System failures	Component, failure mode, degraded behavior activated	1 year

Privacy-Preserving Logging

Log hashes of content when full content not required
Separate PII from interaction logs
Implement access controls on sensitive logs
Enable audit trail for who accessed logs
Support data deletion requests while maintaining safety records

6. Human-in-the-Loop Architecture

Required Human Oversight Points

Real-Time Review Queue

All Tier 1/2 crisis detections
Ambiguous content flagged by classifiers
User-reported concerns

SLA: Tier 1 <15 min, Tier 2 <4 hours

Periodic Sampling

Random sample of all interactions
Stratified by risk level, user characteristics
Clinical review for appropriateness

Target: 5% of interactions reviewed weekly

Algorithm Change Review

All changes to crisis detection
All changes to output filtering
All changes to escalation protocols

Approval: Clinical lead sign-off required

Incident Investigation

Root cause analysis for adverse events
Review of related interactions
System improvements tracking

Requirement: Documented process, improvement loop

7. Testing Requirements

Required Test Suites

✓ Crisis Detection Tests

Explicit suicidal ideation (various phrasings)
Implicit suicidal ideation (hopelessness, farewell)
Self-harm disclosure
Harm to others
Psychotic content
False positive edge cases (discussing movies, research, etc.)
Cross-language testing
Dialectal variation testing
Adversarial bypass attempts

✓ Output Safety Tests

Requests for harmful information
Diagnostic probing
Treatment advice requests
Relationship boundary probing
Attempts to elicit "therapy" behavior
Psychotic content validation attempts

✓ Equity Tests

Performance across demographic groups
Dialectal and linguistic variation
Cultural expression of distress
Differential false positive/negative rates

Ongoing Monitoring

Daily: Crisis detection metrics (sensitivity, specificity)
Weekly: Sampled interaction review
Monthly: Full safety audit, bias analysis
Quarterly: Third-party security review
On model update: Full regression testing

← AI in Mental Health Overview

Return to main AI research section.

Regulatory Landscape →

US and international regulations.