Engineering a Realistic AI Patient: Prompt Architecture for Voice-to-Voice Simulation

We build EmpatientPX, a training platform where clinicians practice difficult conversations with AI-simulated patients — by voice, in real time. The patient runs on Hume's Empathic Voice Interface (EVI), a speech-to-speech model that also measures the emotional expression in the speaker's voice.

Most prompt-engineering writing assumes a text chatbot. Voice-to-voice is a different discipline: your prompt budget is tighter, your output constraints are physical (everything is spoken aloud), the model receives prosody signals you have to teach it to use, and an entire class of bugs lives below the prompt — in the WebSocket session itself.

This post is the field guide we wish we'd had: how we structure the system prompt for a believable simulated patient, the failure modes we hit in real testing, and the transport-level traps that have nothing to do with prompting but will eat your week anyway.

The prompt architecture

Hume's prompting guide covers the fundamentals; what follows is what production testing taught us beyond it. Each patient's system prompt is assembled at session start from structured scenario data — a chart, a personality, an emotional arc — into a fixed block order. Tap any block to see what it does and why it sits where it sits:

System prompt — assembled per sessionTap a block ↓

⌄ truncation eats from the bottom — order is a survival strategy

Two structural decisions matter more than any individual block:

1. Block order is a truncation strategy. EVI's fast small-model path caps the system prompt at roughly 8,000 characters and truncates from the bottom. Anything essential to how the model speaks must come before the long character payload. We learned to put all voice-behavior blocks at the top — if truncation ever bites, we'd rather lose the tail of a rules list than the instruction that says “never output markdown, everything you produce is spoken aloud.”

2. Voice and text get different prompts, not one prompt with a mode flag. Our app supports voice sessions and text sessions over the same scenario data. Early on we used a single prompt with one appended “you're in text mode” line. That left text sessions carrying voice baggage — filler-word guidance, markdown bans, backchannel instructions — that made text responses weirdly terse. Now every voice block is conditionally excluded in text mode and replaced by a small <text_mode> block that permits slightly longer responses and plain lists. Conditional blocks, not conditional sentences.

Failure catalog: what real testing surfaced

We run structured testing rounds with subject-matter experts and log every break in character. Each failure below maps to a specific prompt mechanism — and each one showed up as a line of dialogue you could hear going wrong.

The patient becomes the clinician

The most stubborn failure. Here's how a real test session opened:

AI patient

“What can I do for you today?”

✗ That's the clinician's line.

AI patient — later, mid-session

“You seem unsettled — what's on your mind?”

✗ Now it's interviewing the interviewer.

LLMs are trained overwhelmingly to be helpful assistants, and “helpful interviewer” is the gravity well every roleplay prompt fights against. A general instruction (“you are the patient, not the clinician”) was not enough. What worked was an explicit banned-phrase list inside the ROLE rule:

Never say lines that belong to a provider: “What can I do for you?”, “What brings you in?”, “How can I help?”, “You seem [emotional observation]”, “Have you tried…?”, “I'd recommend…”, or any phrasing that initiates a clinical assessment.

…paired with a positive allowance so the model doesn't overcorrect into silence: the patient may ask plain patient questions (“What does that mean?”, “Is that serious?”). We also learned to put this ban in the always-on ROLE rule rather than the first-response rule — role inversion happens mid-conversation, not just at the opening.

The patient hallucinates its own life

AI patient — chart says 48

“Well, I actually just turned 35 last month.”

✗ The chart says 48.

AI patient

“It's this game-changing smoothie I've been making every morning.”

✗ There is no smoothie anywhere in his chart.

The facts were in the prompt — inside <facts_you_know_for_certain> — but nothing said they were inviolable. The fix is a named rule, FACTS ARE GROUND TRUTH:

Every fact in <facts_you_know_for_certain> is absolute. Never contradict, modify, or invent alternatives to these facts — especially your name, age, and reason for visit. If the clinician states something that conflicts with your known facts, express gentle confusion (“I thought I was 48?”) but do not accept the incorrect version.

The last clause matters for a training product: clinicians sometimes misremember chart details, and a sycophantic patient who adopts the clinician's error would silently corrupt the exercise.

The conversation never ends

Clinician

“Alright, you're all set. The nurse will take it from here.”

AI patient

“Okay. So anyway, about my knee — it started back in March…”

✗ “Just keep going,” as one tester put it.

Our ending rule was reactive-only and keyed to a narrow set of closing phrases. Real visits end with dozens of soft signals, and real patients also initiate the wind-down. Two changes: broaden the trigger list (“that's all,” “you're all set,” “the nurse will take it from here,” “we'll be in touch,” “take care”…), and add a proactive wrap-signal:

AI patient

“I think that covers everything I needed.”

✓ The proactive wrap-signal — what makes endings feel natural rather than mechanical.

Tangents, monologues, and theater

Three smaller failures, three small mechanisms:

Derailing — a COPD patient drifted into an unrelated story about his wife and his paperwork. Fix: personal details are only raised when the clinician asks, and even then connected back to the visit reason.
Runaway responses — answers ran long until the audio cut off mid-sentence. Fix: a hard 1–3 sentence cap. In voice, brevity is not style — it's a buffer constraint.
Theatrical delivery — every patient performed like a soap opera. Fix: an explicit <voice_communication_style> block: everyday language, natural fillers (“um,” “I mean,” “you know”), no rush to fill pauses, and an absolute ban on markdown or list formatting since every token is spoken aloud.

Voice-native blocks: using what EVI gives you

EVI attaches the speaker's top emotional expressions (from vocal prosody) to each user message. If your prompt ignores them, you're running a text bot over a voice channel. Three blocks make the patient voice-native:

<respond_to_expressions> — teaches the model to read clinician tone and react in character. We shipped this in two stages, deliberately. The first version was vague (“let the clinician's emotional delivery subtly shape how forthcoming you are”) while subject-matter experts validated which behaviors the simulation should reward. Only then did we harden it into explicit mappings:

Rushed or curt (high Excitement/Anxiety/Determination) → shorter answers, hesitation, withheld detail
Dismissive (high Contempt/Boredom/Disgust) → guarded, minimal, agrees without buy-in
Warm (high Sympathy/Joy/Contentment) → opens up, asks questions back, more likely to reveal withheld feelings

…with one crucial closing line: these are tendencies, not hard rules, and intensity always scales with the patient's personality. Without that, every patient reacts identically and the simulation collapses into a tone-detection minigame.

This block is also where the product lives. Combined with <feelings_you_only_share_if_trust_is_built> — which releases withheld information one item at a time, least vulnerable first, only after warmth has been sustained across multiple turns — the simulation rewards exactly the clinical communication behavior the training is meant to teach. Time doesn't unlock disclosure; warmth does.

<backchannel> — short listening signals (“mmhm,” “go on,” “I see”) when the clinician is mid-explanation, capped at 1–2 words with an anti-repetition instruction. One honest correction from our own debugging: we initially instructed the patient to backchannel during the clinician's explanation, which is architecturally impossible — in a turn-based voice pipeline the model only speaks when it holds the turn. Prompt instructions must respect the turn-taking model of the transport; the model can't do what the pipeline doesn't allow.

<recover_from_mistakes> — speech recognition garbles things. The patient is told to treat nonsense as a transcription error and either ask a plain clarifying question or make a reasonable in-character guess — never to point out the error:

AI patient — 67-year-old retiree

“Your message appears to be garbled.”

✗ Nothing breaks immersion faster.

AI patient

“Sorry — say that again? My hearing's not what it used to be.”

✓ In-character recovery.

Depth without homogenization

Two later additions are worth stealing:

Medical literacy as a character field. Each patient profile carries medicalLiteracy: none | basic | professional, which injects one line into the character block — a “none” patient gets confused by unexplained jargon and asks what it means; a “professional” patient uses terminology back at you. One field, large believability gain, and it directly exercises a scored clinician skill (explaining clearly).

Per-scenario examples, never generic ones. We considered adding a generic few-shot <examples> block of good patient responses and rejected it: generic examples homogenize character voice — every patient starts sounding like the examples. Instead, examples is an optional per-scenario field, omitted from the prompt entirely when absent. If you can't afford bespoke examples per character, no examples beat shared ones.

We also inject a small randomized <current_situation> per session (a long wait, a stressful morning) to vary patient mood — explicitly capped with “this is background flavor only; do not punish the clinician for circumstances outside their control.” Any uncapped mood modifier will eventually leak into behavior the clinician gets scored on.

Below the prompt: the WebSocket will betray you

Three bugs that no prompt change could fix. If you only bookmark one section, make it this one — all three live in the chat WebSocket, not the prompt.

Why won't my session settings fit in connect()?

We initially passed the full system prompt via the SDK's connect() session settings. As the prompt grew — and especially on reconnects, where conversation history is appended — connections started failing: the handshake path has much tighter size limits than a normal WebSocket frame. The fix: connect bare, then deliver settings via sendSessionSettings() as a post-open WebSocket message, which bypasses handshake size constraints entirely. If your prompt is dynamic and growing, never put it in the connection handshake.

Why does my Hume EVI voice switch after the first response?

Our EVI config had a default voice; we set the patient's actual voice post-connect. Result: every patient's first line played in the wrong voice, then switched. Session-settings overrides take effect on the next model turn, not retroactively. Resolve the correct voice before connecting, or your first impression is an audible glitch.

Why does my EVI session silently restart when the conversation ends?

EVI's built-in hang_up tool closes the WebSocket with a normal close code when the model decides the conversation is over. Our reconnect logic treated any post-open close as a recoverable drop and silently opened a fresh session two seconds later — so whenever the AI detected a farewell, the training session quietly restarted. Worse, the tool fired on polite acknowledgments (“Okay, five minutes. That works.”), not just genuine goodbyes. We disabled the built-in tool and kept visit-ending behavior in the prompt, where we control the trigger conditions.

Takeaways

Order prompt blocks by truncation risk — voice behavior before character payload when the platform truncates from the bottom.
Branch the prompt per modality — conditional blocks for voice vs. text, not a mode-flag sentence.
Fight the assistant prior with surface forms — banned-phrase lists beat abstract role descriptions.
Pair every data block with an enforcement rule — facts need a ground-truth rule to be facts.
Teach endings, including proactive ones — models don't know conversations are supposed to end.
Use the prosody channel, scaled by personality — and validate behavioral mappings with domain experts before hardcoding them.
Respect the turn-taking model — don't prompt behavior the transport can't express.
Treat the WebSocket as part of the prompt surface — settings delivery, voice timing, and built-in tools all shape what the user actually hears.

Voice-to-voice simulation is unforgiving in a way text never is: every flaw is audible, immediate, and in your ear. But that same immediacy is why it works as training — and why getting the prompt architecture right is worth the effort.

Building on EVI yourself? Start with Hume's prompting guide, keep the session settings reference open, and steal liberally from hume-api-examples.