WebSocket Protocol

This page documents the raw WebSocket protocol for Perf Voice Agents. Use this if you need full control over the audio pipeline or are integrating from a platform where the JavaScript SDK isn’t available.

For most web applications, the JavaScript SDK is the recommended approach — it handles all of the protocol details described here.

Connection

Endpoint

wss://api.withperf.pro/v1/voice/conversation

Query Parameters

Parameter	Type	Required	Description
`api_key`	string	Yes	Your project API key (`pk_live_...`)
`agent_id`	string	Yes	Voice agent ID

Example

const ws = new WebSocket(
  'wss://api.withperf.pro/v1/voice/conversation?api_key=YOUR_API_KEY&agent_id=YOUR_AGENT_ID'
);

Protocol Flow

Client                          Perf
  |                               |
  |------- WebSocket OPEN ------->|
  |                               |--- connects to voice pipeline
  |<-- conversation_initiation ---|
  |                               |
  |--- user_audio_chunk --------->|  (repeat: stream mic audio)
  |<-- audio ---------------------|  (agent speaks back)
  |<-- agent_response ------------|  (agent transcript)
  |<-- user_transcript -----------|  (user transcript)
  |                               |
  |<-- ping ----------------------|  (keepalive)
  |--- pong --------------------->|
  |                               |
  |<-- interruption --------------|  (user spoke over agent)
  |                               |
  |------- WebSocket CLOSE ------>|

Important: Do not send audio until you receive the conversation_initiation_metadata message. Sending audio before initialization will cause the connection to close with code 1008.

Messages: Client → Server

Send Audio

Stream microphone audio as base64-encoded PCM16 chunks:

{
  "user_audio_chunk": "<base64-encoded PCM16 audio>"
}

Audio format: PCM 16-bit signed integer, 16kHz, mono, little-endian. Base64-encode the raw bytes. Recommended chunk size: 2048 samples (128ms at 16kHz).

JavaScript Example: Capture and Send Microphone Audio

const stream = await navigator.mediaDevices.getUserMedia({
  audio: { sampleRate: 16000, channelCount: 1, echoCancellation: true, noiseSuppression: true }
});

const audioCtx = new AudioContext({ sampleRate: 16000 });
const source = audioCtx.createMediaStreamSource(stream);
const processor = audioCtx.createScriptProcessor(2048, 1, 1);

processor.onaudioprocess = (e) => {
  if (ws.readyState !== WebSocket.OPEN || !ready) return;

  const input = e.inputBuffer.getChannelData(0);
  const pcm16 = new Int16Array(input.length);
  for (let i = 0; i < input.length; i++) {
    const s = Math.max(-1, Math.min(1, input[i]));
    pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
  }

  const bytes = new Uint8Array(pcm16.buffer);
  let binary = '';
  for (let j = 0; j < bytes.length; j++) {
    binary += String.fromCharCode(bytes[j]);
  }

  ws.send(JSON.stringify({ user_audio_chunk: btoa(binary) }));
};

source.connect(processor);
processor.connect(audioCtx.destination);

Pong (Keepalive Response)

Reply to ping messages to keep the connection alive:

{
  "type": "pong",
  "event_id": "<event_id from the ping>"
}

Messages: Server → Client

`conversation_initiation_metadata`

Sent once after connection is established. Signals that the voice pipeline is ready.

{
  "type": "conversation_initiation_metadata",
  "conversation_initiation_metadata_event": {
    "conversation_id": "conv_abc123",
    "agent_output_audio_format": "pcm_16000"
  }
}

Field	Description
`conversation_id`	Unique session identifier
`agent_output_audio_format`	Output audio format (usually `pcm_16000`)

Start sending audio only after receiving this message.

`audio`

Agent speech audio. Base64-encoded PCM16, same format as input.

{
  "type": "audio",
  "audio_event": {
    "audio_base_64": "<base64-encoded PCM16 audio>"
  }
}

JavaScript Example: Play Agent Audio

let nextPlayTime = 0;
const sources = [];

function playAudio(base64) {
  const bin = atob(base64);
  const bytes = new Uint8Array(bin.length);
  for (let i = 0; i < bin.length; i++) bytes[i] = bin.charCodeAt(i);

  const pcm16 = new Int16Array(bytes.buffer);
  const float32 = new Float32Array(pcm16.length);
  for (let i = 0; i < pcm16.length; i++) float32[i] = pcm16[i] / 32768;

  const buffer = audioCtx.createBuffer(1, float32.length, 16000);
  buffer.getChannelData(0).set(float32);

  const src = audioCtx.createBufferSource();
  src.buffer = buffer;
  src.connect(audioCtx.destination);

  // Schedule sequentially to avoid gaps
  const now = audioCtx.currentTime;
  if (nextPlayTime < now) nextPlayTime = now;
  src.start(nextPlayTime);
  nextPlayTime += buffer.duration;

  // Track for interruption cleanup
  sources.push(src);
  src.onended = () => {
    const idx = sources.indexOf(src);
    if (idx !== -1) sources.splice(idx, 1);
  };
}

`agent_response`

The agent’s text response (transcript of what the agent is saying).

{
  "type": "agent_response",
  "agent_response_event": {
    "agent_response": "Hello! How can I help you today?"
  }
}

`user_transcript`

Transcript of what the user said.

{
  "type": "user_transcript",
  "user_transcription_event": {
    "user_transcript": "I'd like to check on my order status."
  }
}

`interruption`

Sent when the user speaks while the agent is talking. You must stop all currently playing agent audio immediately to avoid the agent’s voice overlapping with the new response.

{
  "type": "interruption"
}

// Handle interruption — stop all playing audio
sources.forEach(s => { try { s.stop(); } catch (e) {} });
sources.length = 0;
nextPlayTime = 0;

`ping`

Keepalive ping. You must respond with a pong to keep the connection alive.

{
  "type": "ping",
  "ping_event": {
    "event_id": 12345
  }
}

Complete JavaScript Example

A full working implementation using raw WebSocket (no SDK):

const WS_URL = 'wss://api.withperf.pro/v1/voice/conversation';
const API_KEY = 'YOUR_API_KEY';
const AGENT_ID = 'YOUR_AGENT_ID';

let ws, audioCtx, micStream, ready = false, nextPlay = 0, sources = [];

async function startVoice() {
  // 1. Get microphone
  micStream = await navigator.mediaDevices.getUserMedia({
    audio: { sampleRate: 16000, channelCount: 1, echoCancellation: true, noiseSuppression: true }
  });
  audioCtx = new AudioContext({ sampleRate: 16000 });
  const mic = audioCtx.createMediaStreamSource(micStream);
  const proc = audioCtx.createScriptProcessor(2048, 1, 1);

  // 2. Stream mic audio as base64 PCM16
  proc.onaudioprocess = (e) => {
    if (!ws || ws.readyState !== WebSocket.OPEN || !ready) return;
    const input = e.inputBuffer.getChannelData(0);
    const pcm = new Int16Array(input.length);
    for (let i = 0; i < input.length; i++) {
      const s = Math.max(-1, Math.min(1, input[i]));
      pcm[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
    }
    const bytes = new Uint8Array(pcm.buffer);
    let bin = '';
    for (let i = 0; i < bytes.length; i++) bin += String.fromCharCode(bytes[i]);
    ws.send(JSON.stringify({ user_audio_chunk: btoa(bin) }));
  };
  mic.connect(proc);
  proc.connect(audioCtx.destination);

  // 3. Connect WebSocket
  ws = new WebSocket(WS_URL + '?api_key=' + API_KEY + '&agent_id=' + AGENT_ID);

  ws.onmessage = (event) => {
    if (typeof event.data !== 'string') return;
    const data = JSON.parse(event.data);

    switch (data.type) {
      case 'conversation_initiation_metadata':
        ready = true;
        console.log('Session:', data.conversation_initiation_metadata_event?.conversation_id);
        break;
      case 'audio':
        if (data.audio_event?.audio_base_64) playAudio(data.audio_event.audio_base_64);
        break;
      case 'agent_response':
        console.log('Agent:', data.agent_response_event?.agent_response);
        break;
      case 'user_transcript':
        console.log('You:', data.user_transcription_event?.user_transcript);
        break;
      case 'interruption':
        sources.forEach(s => { try { s.stop(); } catch (e) {} });
        sources = []; nextPlay = 0;
        break;
      case 'ping':
        ws.send(JSON.stringify({ type: 'pong', event_id: data.ping_event?.event_id }));
        break;
    }
  };

  ws.onclose = () => stopVoice();
}

// 4. Play agent audio (base64 PCM16 → AudioBuffer)
function playAudio(base64) {
  if (!audioCtx) return;
  const bin = atob(base64), bytes = new Uint8Array(bin.length);
  for (let i = 0; i < bin.length; i++) bytes[i] = bin.charCodeAt(i);
  const pcm = new Int16Array(bytes.buffer);
  const f32 = new Float32Array(pcm.length);
  for (let i = 0; i < pcm.length; i++) f32[i] = pcm[i] / 32768;
  const buf = audioCtx.createBuffer(1, f32.length, 16000);
  buf.getChannelData(0).set(f32);
  const src = audioCtx.createBufferSource();
  src.buffer = buf;
  src.connect(audioCtx.destination);
  const now = audioCtx.currentTime;
  if (nextPlay < now) nextPlay = now;
  src.start(nextPlay);
  nextPlay += buf.duration;
  sources.push(src);
  src.onended = () => { const i = sources.indexOf(src); if (i !== -1) sources.splice(i, 1); };
}

// 5. Cleanup
function stopVoice() {
  ready = false;
  sources.forEach(s => { try { s.stop(); } catch (e) {} });
  sources = []; nextPlay = 0;
  if (micStream) { micStream.getTracks().forEach(t => t.stop()); micStream = null; }
  if (audioCtx) { audioCtx.close().catch(() => {}); audioCtx = null; }
  if (ws) { ws.close(); ws = null; }
}

WebSocket Close Codes

Code	Meaning
`1000`	Normal closure (client or server initiated)
`1008`	Policy violation (e.g., sending audio before initialization)
`1011`	Server error (internal pipeline failure)
`4001`	Authentication failed (invalid API key)
`4004`	Agent not found (invalid agent_id)

Troubleshooting

Symptom	Cause	Fix
Immediate disconnect with code 1008	Sending audio before `conversation_initiation_metadata`	Wait for the init message before streaming audio
No audio from agent	Playing audio as binary instead of decoding base64 PCM16	Decode base64 → Int16Array → Float32Array → AudioBuffer
Agent speaks over itself	Not handling `interruption` events	Stop all scheduled AudioBufferSourceNodes on interruption
Connection drops after ~30s	Not responding to `ping` messages	Send `pong` response with the `event_id` from each ping
Audio is garbled	Wrong sample rate or encoding	Ensure PCM16, 16kHz, mono, little-endian

Voice Agents Overview — Architecture and features
JavaScript SDK — Recommended for web apps
Python Integration — Server-side integration

Getting Started

Voice Agents

SDKs

API Documentation

Platform

Advanced

Resources

WebSocket Protocol

WebSocket Protocol

Connection

Endpoint

Query Parameters

Example

Protocol Flow

Messages: Client → Server

Send Audio

JavaScript Example: Capture and Send Microphone Audio

Pong (Keepalive Response)

Messages: Server → Client

`conversation_initiation_metadata`

`audio`

JavaScript Example: Play Agent Audio

`agent_response`

`user_transcript`

`interruption`

`ping`

Complete JavaScript Example

WebSocket Close Codes

Troubleshooting

​WebSocket Protocol

​Connection

​Endpoint

​Query Parameters

​Example

​Protocol Flow

​Messages: Client → Server

​Send Audio

​JavaScript Example: Capture and Send Microphone Audio

​Pong (Keepalive Response)

​Messages: Server → Client

​conversation_initiation_metadata

​audio

​JavaScript Example: Play Agent Audio

​agent_response

​user_transcript

​interruption

​ping

​Complete JavaScript Example

​WebSocket Close Codes

​Troubleshooting

​Related

WebSocket Protocol

Connection

Endpoint

Query Parameters

Example

Protocol Flow

Messages: Client → Server

Send Audio

JavaScript Example: Capture and Send Microphone Audio

Pong (Keepalive Response)

Messages: Server → Client

`conversation_initiation_metadata`

`audio`

JavaScript Example: Play Agent Audio

`agent_response`

`user_transcript`

`interruption`

`ping`

Complete JavaScript Example

WebSocket Close Codes

Troubleshooting

Related