Skip to main content

Voice Agents

Build real-time conversational voice agents powered by your custom instructions, knowledge base, and content safety policies. Voice agents handle speech recognition, natural language understanding, response generation, and text-to-speech — all through a single WebSocket connection.

How It Works

Your App → WebSocket → Perf → LLM + TTS + STT

         Audio streaming
       (PCM16, 16kHz, mono)
  1. Your application opens a WebSocket connection to Perf
  2. Perf establishes a real-time voice pipeline (speech-to-text, LLM, text-to-speech)
  3. Your app streams microphone audio to Perf, and receives agent audio + transcripts back
  4. Content safety policies are evaluated on every turn

Quick Start

1. Create a Voice Agent

In the Perf Dashboard, click Create Agent and configure:
  • Name — A label for your agent (e.g. “Customer Support”)
  • System Prompt — Instructions that define the agent’s behavior
  • Voice — Choose from available voices
  • First Message — What the agent says when a conversation starts
  • Content Policy (optional) — Attach a policy for PII redaction, blocked terms, or custom safety criteria

2. Add the SDK

The fastest way to integrate is the PerfVoice JavaScript SDK:
<script src="https://api.withperf.pro/v1/voice/sdk.js"></script>
<script>
  const voice = new PerfVoice({
    apiKey: 'YOUR_API_KEY',
    agentId: 'YOUR_AGENT_ID',
  });

  voice.on('transcript', (role, text) => {
    console.log(role + ': ' + text);
  });

  document.getElementById('startBtn').onclick = () => voice.start();
  document.getElementById('stopBtn').onclick = () => voice.stop();
</script>
That’s it. The SDK handles microphone capture, audio encoding, WebSocket protocol, audio playback, interruptions, and ping/pong keepalive.

3. Test It

Click your start button, allow microphone access, and speak. You should hear the agent respond and see transcripts in the console.

Features

FeatureDescription
Real-time streamingSub-second latency from speech to agent response
Interruption handlingUsers can interrupt the agent mid-sentence
Custom voicesChoose from multiple voice options
Knowledge base (RAG)Attach document collections for grounded answers
Web searchEnable real-time web search for up-to-date information
Content safetyPII detection, blocked terms, custom criteria
Loop detectionAutomatic detection and breaking of conversational loops
TranscriptsReal-time agent and user transcripts via events

Integration Options

MethodBest ForDocs
JavaScript SDKWeb apps, fastest integrationSDK Reference
Raw WebSocketFull control, custom audio pipelinesWebSocket Protocol
PythonServer-side, IVR systems, telephonyPython Integration

Authentication

Voice agent connections require two parameters:
ParameterDescription
api_keyYour project API key (format: pk_live_...)
agent_idThe voice agent ID (from the dashboard or API)
These are passed as query parameters on the WebSocket URL:
wss://api.withperf.pro/v1/voice/conversation?api_key=YOUR_API_KEY&agent_id=YOUR_AGENT_ID

Audio Format

All audio is streamed as PCM 16-bit, 16kHz, mono, little-endian:
PropertyValue
EncodingPCM signed 16-bit integer
Sample rate16,000 Hz
Channels1 (mono)
Byte orderLittle-endian
TransportBase64-encoded in JSON messages

Content Safety

Voice agents support the same content safety policies as the rest of the Perf platform:
  • Blocked terms — Prevent specific words or phrases in agent responses
  • PII detection — Detect and redact personally identifiable information
  • Custom criteria — Define LLM-evaluated safety rules (e.g. “Agent must not provide medical advice”)
  • Filler phrases — Play natural filler audio while safety evaluation runs
Configure policies in the Dashboard and attach them to your voice agent.

Next Steps