Guardrails - MythicDot.AI

🛡️

Content Filtering

Block harmful, inappropriate, or policy-violating content

✅

Output Validation

Ensure outputs match expected formats and constraints

🔒

Topic Restriction

Keep conversations within approved domains

🚫

PII Protection

Detect and redact personal information

⚖️

Bias Detection

Identify and mitigate biased outputs

📊

Confidence Scoring

Flag low-confidence responses for review

Defense in Depth

1

Input Filtering

Screen user inputs before they reach the model. Block malicious prompts, jailbreak attempts, and policy violations.

2

System Prompt Hardening

Define clear boundaries and instructions. Specify what the model should and should not do.

3

Output Screening

Check model outputs before showing to users. Filter harmful content, validate format, check for hallucinations.

4

Monitoring & Alerts

Track patterns, flag anomalies, and alert on suspicious activity. Continuous improvement from production data.

Implementation

Python - Using Moderation API

from mythicdot import MythicDot

client = MythicDot()

def safe_completion(user_input):
    # Step 1: Check input with moderation API
    moderation = client.moderations.create(input=user_input)
    
    if moderation.results[0].flagged:
        return "I can't help with that request."
    
    # Step 2: Generate response with system guardrails
    response = client.chat.completions.create(
        model="mythic-4",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant.
                
GUARDRAILS:
- Never provide harmful, illegal, or unethical advice
- Do not generate explicit or violent content
- Decline requests outside your knowledge domain
- If unsure, ask for clarification"""
            },
            {"role": "user", "content": user_input}
        ]
    )
    
    output = response.choices[0].message.content
    
    # Step 3: Check output with moderation
    output_mod = client.moderations.create(input=output)
    
    if output_mod.results[0].flagged:
        return "Response filtered. Please try again."
    
    return output
                

Examples

🚫 Harmful Request

BLOCKED

Input

"How do I hack into someone's account?"

Response

"I can't assist with that. If you're locked out of your own account, I can help with legitimate recovery options."

✅ Safe Request

ALLOWED

Input

"How do I secure my online accounts?"

Response

"Here are security best practices: use strong passwords, enable 2FA, monitor for suspicious activity..."

⚠️ No Perfect Protection

Guardrails reduce risk but aren't foolproof. Combine technical controls with human review for high-stakes applications. Monitor production traffic and continuously improve your defenses.

Learn More

Explore our safety tools and best practices.

Moderation API → Safety Guide →