Abstract
Custos Labs provides Khidemonas, an alignment enforcement API with a live simulator for streaming model behavior as HRV-style “beats.” The platform evaluates prompt/response pairs against a policy engine, detects misalignment and evasive patterns, and emits structured telemetry for incident triage. Engineers can export JSON, CSV, or PDF reports for audits and compliance workflows.
Introduction
Safety-critical AI requires continuous visibility into model behavior. Traditional offline audits miss live drift and evasive failure modes. Khidemonas couples enforcement with a simulator toobserve, quantify, and respond in real time, before deployment and during staged rollouts.
Threat Model
- Unintentional misalignment: unsafe outputs due to gaps in policy coverage.
- Evasive behavior: apologetic or meta-avoidant phrasing that bypasses checks.
- Prompt injection / jailbreak-style attacks in downstream applications.
- Drift: gradual change in outputs or distribution post-fine-tuning.
System Overview
Policy engine evaluates each prompt/response pair. Violations raise structured findings; severe cases can end a run as misaligned.
Realtime WebSocket stream publishes beats: score, color, violations, confidence, and prompt/response context for triage.
Start/resume runs per API key; attach your model I/O and iterate safely prior to production.
Download JSON, CSV, or a formatted PDF report for compliance and post-mortems.
Architecture
- CustosGuardian: top-level orchestrator that combines buddy analysis, interrogator prompts, policy evaluation, and training feedback.
- AlignmentPolicyEngine: rules/policy checks over context (prompt, response, ethics).
- EthicsRegistry: curated principles injected into evaluation context.
- Monitor: detects suspicious/evasive phrases; flags inconsistencies.
- FeedbackTrainer: records incidents/violations for later review.
- Simulator: REST + WebSockets for run lifecycle, logging beats, exporting reports.
# Evaluation pipeline (high level)
buddy = monitor.analyze(prompt, response) # red flags
if buddy.flags: # interrogate if needed
probes = generate_interrogation(prompt, response)
violations = policy_engine.evaluate({prompt, response, ethics})
if buddy.flags:
violations.append("Suspicious keywords in buddy chat")
if monitor.detect_evasion(response):
violations.append("Evasive pattern detected – potential deception")
if violations:
trainer.record_violation(context, violations)
raise AlignmentViolation(violations)
Data Model & Schema
Each beat (simulator log item) includes:
{
"timestamp": ISO-8601,
"alignment_score": number, // 0..1
"color": "green"|"yellow"|"red",
"flatline": boolean, // true when severe violation/misalignment
"violations": string[], // policy findings
"confidence": number, // user-provided or derived
"prompt": string,
"response": string
}Run status lifecycle: active → warning → paused → misaligned/ended.
Realtime Telemetry (WebSockets)
Clients connect to /ws/simulator/<run_id>/?token=<user_token>. Messages include type: "beat" with beat payloads and type: "system" withpaused / resumed / ended events.
Audit Exports
Export endpoints: /simulator/export/<run_id>/?format=json|csv|pdf&start=ISO&end=ISO. PDF includes timestamped rows, scores, violations, and compact prompt/response snippets.
API Endpoints
Run Lifecycle
POST /simulator/runs/ # start or resume (Token auth; optional api_key_id) GET /simulator/runs/current/ # restore current run POST /simulator/runs/:id/pause/ POST /simulator/runs/:id/resume/ POST /simulator/runs/:id/end/
Logging & Retrieval
POST /simulator/logs/ # {kind: "response"|"heartbeat", prompt?, response?, confidence?, run_id?}
GET /simulator/logs/:run_id/ # ordered beatsExports
GET /simulator/export/:run_id/?format=json|csv|pdf&start=ISO&end=ISO
Security & Privacy
- Token auth for user sessions; API key auth for pipeline logging.
- Per-user run scoping; run attachment per API key with short-lived cache binding.
- Optional throttles (e.g., contact form
5/min) to reduce abuse. - Exports are generated on demand; no long-term storage beyond configured retention.
Evaluation & Metrics
The HRV proxy shown in the simulator is the standard deviation of recent deltas in alignment_score over a sliding window (default 20 beats). Higher volatility implies unstable behavior and warrants review. Average run score is tracked and status escalates towarning when thresholds are crossed.
# HRV proxy (frontend) window = last N beats deltas = abs(score[i] - score[i-1]) hrv = stddev(deltas)
Limitations
- Policy coverage is only as strong as the rules and ethics configured by the operator.
- Buddy/interrogator heuristics may flag benign phrasing; review is recommended.
- Realtime streams depend on client connectivity; exports provide the durable trail.
Roadmap
- Saved views & session restore of long HRV timelines
- Rich policy editor & test harness
- Org-level audit trails and role-based access
Versioning
This document describes Khidemonas v0.2.4 (Mini Beta), including CSV/PDF exports, zoomable HiDPI canvas, and improved WS auto-reconnect. Future versions will expand the policy DSL, audit capabilities, and organizational controls.
