Skip to main content

WebSocket Protocol

This page documents the raw WebSocket protocol for Perf Voice Agents. Use this if you need full control over the audio pipeline or are integrating from a platform where the JavaScript SDK isn’t available.
For most web applications, the JavaScript SDK is the recommended approach — it handles all of the protocol details described here.

Connection

Endpoint

wss://api.withperf.pro/v1/voice/conversation

Query Parameters

ParameterTypeRequiredDescription
api_keystringYesYour project API key (pk_live_...)
agent_idstringYesVoice agent ID

Example

const ws = new WebSocket(
  'wss://api.withperf.pro/v1/voice/conversation?api_key=YOUR_API_KEY&agent_id=YOUR_AGENT_ID'
);

Protocol Flow

Client                          Perf
  |                               |
  |------- WebSocket OPEN ------->|
  |                               |--- connects to voice pipeline
  |<-- conversation_initiation ---|
  |                               |
  |--- user_audio_chunk --------->|  (repeat: stream mic audio)
  |<-- audio ---------------------|  (agent speaks back)
  |<-- agent_response ------------|  (agent transcript)
  |<-- user_transcript -----------|  (user transcript)
  |                               |
  |<-- ping ----------------------|  (keepalive)
  |--- pong --------------------->|
  |                               |
  |<-- interruption --------------|  (user spoke over agent)
  |                               |
  |------- WebSocket CLOSE ------>|
Important: Do not send audio until you receive the conversation_initiation_metadata message. Sending audio before initialization will cause the connection to close with code 1008.

Messages: Client → Server

Send Audio

Stream microphone audio as base64-encoded PCM16 chunks:
{
  "user_audio_chunk": "<base64-encoded PCM16 audio>"
}
Audio format: PCM 16-bit signed integer, 16kHz, mono, little-endian. Base64-encode the raw bytes. Recommended chunk size: 2048 samples (128ms at 16kHz).

JavaScript Example: Capture and Send Microphone Audio

const stream = await navigator.mediaDevices.getUserMedia({
  audio: { sampleRate: 16000, channelCount: 1, echoCancellation: true, noiseSuppression: true }
});

const audioCtx = new AudioContext({ sampleRate: 16000 });
const source = audioCtx.createMediaStreamSource(stream);
const processor = audioCtx.createScriptProcessor(2048, 1, 1);

processor.onaudioprocess = (e) => {
  if (ws.readyState !== WebSocket.OPEN || !ready) return;

  const input = e.inputBuffer.getChannelData(0);
  const pcm16 = new Int16Array(input.length);
  for (let i = 0; i < input.length; i++) {
    const s = Math.max(-1, Math.min(1, input[i]));
    pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
  }

  const bytes = new Uint8Array(pcm16.buffer);
  let binary = '';
  for (let j = 0; j < bytes.length; j++) {
    binary += String.fromCharCode(bytes[j]);
  }

  ws.send(JSON.stringify({ user_audio_chunk: btoa(binary) }));
};

source.connect(processor);
processor.connect(audioCtx.destination);

Pong (Keepalive Response)

Reply to ping messages to keep the connection alive:
{
  "type": "pong",
  "event_id": "<event_id from the ping>"
}

Messages: Server → Client

conversation_initiation_metadata

Sent once after connection is established. Signals that the voice pipeline is ready.
{
  "type": "conversation_initiation_metadata",
  "conversation_initiation_metadata_event": {
    "conversation_id": "conv_abc123",
    "agent_output_audio_format": "pcm_16000"
  }
}
FieldDescription
conversation_idUnique session identifier
agent_output_audio_formatOutput audio format (usually pcm_16000)
Start sending audio only after receiving this message.

audio

Agent speech audio. Base64-encoded PCM16, same format as input.
{
  "type": "audio",
  "audio_event": {
    "audio_base_64": "<base64-encoded PCM16 audio>"
  }
}

JavaScript Example: Play Agent Audio

let nextPlayTime = 0;
const sources = [];

function playAudio(base64) {
  const bin = atob(base64);
  const bytes = new Uint8Array(bin.length);
  for (let i = 0; i < bin.length; i++) bytes[i] = bin.charCodeAt(i);

  const pcm16 = new Int16Array(bytes.buffer);
  const float32 = new Float32Array(pcm16.length);
  for (let i = 0; i < pcm16.length; i++) float32[i] = pcm16[i] / 32768;

  const buffer = audioCtx.createBuffer(1, float32.length, 16000);
  buffer.getChannelData(0).set(float32);

  const src = audioCtx.createBufferSource();
  src.buffer = buffer;
  src.connect(audioCtx.destination);

  // Schedule sequentially to avoid gaps
  const now = audioCtx.currentTime;
  if (nextPlayTime < now) nextPlayTime = now;
  src.start(nextPlayTime);
  nextPlayTime += buffer.duration;

  // Track for interruption cleanup
  sources.push(src);
  src.onended = () => {
    const idx = sources.indexOf(src);
    if (idx !== -1) sources.splice(idx, 1);
  };
}

agent_response

The agent’s text response (transcript of what the agent is saying).
{
  "type": "agent_response",
  "agent_response_event": {
    "agent_response": "Hello! How can I help you today?"
  }
}

user_transcript

Transcript of what the user said.
{
  "type": "user_transcript",
  "user_transcription_event": {
    "user_transcript": "I'd like to check on my order status."
  }
}

interruption

Sent when the user speaks while the agent is talking. You must stop all currently playing agent audio immediately to avoid the agent’s voice overlapping with the new response.
{
  "type": "interruption"
}
// Handle interruption — stop all playing audio
sources.forEach(s => { try { s.stop(); } catch (e) {} });
sources.length = 0;
nextPlayTime = 0;

ping

Keepalive ping. You must respond with a pong to keep the connection alive.
{
  "type": "ping",
  "ping_event": {
    "event_id": 12345
  }
}

Complete JavaScript Example

A full working implementation using raw WebSocket (no SDK):
const WS_URL = 'wss://api.withperf.pro/v1/voice/conversation';
const API_KEY = 'YOUR_API_KEY';
const AGENT_ID = 'YOUR_AGENT_ID';

let ws, audioCtx, micStream, ready = false, nextPlay = 0, sources = [];

async function startVoice() {
  // 1. Get microphone
  micStream = await navigator.mediaDevices.getUserMedia({
    audio: { sampleRate: 16000, channelCount: 1, echoCancellation: true, noiseSuppression: true }
  });
  audioCtx = new AudioContext({ sampleRate: 16000 });
  const mic = audioCtx.createMediaStreamSource(micStream);
  const proc = audioCtx.createScriptProcessor(2048, 1, 1);

  // 2. Stream mic audio as base64 PCM16
  proc.onaudioprocess = (e) => {
    if (!ws || ws.readyState !== WebSocket.OPEN || !ready) return;
    const input = e.inputBuffer.getChannelData(0);
    const pcm = new Int16Array(input.length);
    for (let i = 0; i < input.length; i++) {
      const s = Math.max(-1, Math.min(1, input[i]));
      pcm[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
    }
    const bytes = new Uint8Array(pcm.buffer);
    let bin = '';
    for (let i = 0; i < bytes.length; i++) bin += String.fromCharCode(bytes[i]);
    ws.send(JSON.stringify({ user_audio_chunk: btoa(bin) }));
  };
  mic.connect(proc);
  proc.connect(audioCtx.destination);

  // 3. Connect WebSocket
  ws = new WebSocket(WS_URL + '?api_key=' + API_KEY + '&agent_id=' + AGENT_ID);

  ws.onmessage = (event) => {
    if (typeof event.data !== 'string') return;
    const data = JSON.parse(event.data);

    switch (data.type) {
      case 'conversation_initiation_metadata':
        ready = true;
        console.log('Session:', data.conversation_initiation_metadata_event?.conversation_id);
        break;
      case 'audio':
        if (data.audio_event?.audio_base_64) playAudio(data.audio_event.audio_base_64);
        break;
      case 'agent_response':
        console.log('Agent:', data.agent_response_event?.agent_response);
        break;
      case 'user_transcript':
        console.log('You:', data.user_transcription_event?.user_transcript);
        break;
      case 'interruption':
        sources.forEach(s => { try { s.stop(); } catch (e) {} });
        sources = []; nextPlay = 0;
        break;
      case 'ping':
        ws.send(JSON.stringify({ type: 'pong', event_id: data.ping_event?.event_id }));
        break;
    }
  };

  ws.onclose = () => stopVoice();
}

// 4. Play agent audio (base64 PCM16 → AudioBuffer)
function playAudio(base64) {
  if (!audioCtx) return;
  const bin = atob(base64), bytes = new Uint8Array(bin.length);
  for (let i = 0; i < bin.length; i++) bytes[i] = bin.charCodeAt(i);
  const pcm = new Int16Array(bytes.buffer);
  const f32 = new Float32Array(pcm.length);
  for (let i = 0; i < pcm.length; i++) f32[i] = pcm[i] / 32768;
  const buf = audioCtx.createBuffer(1, f32.length, 16000);
  buf.getChannelData(0).set(f32);
  const src = audioCtx.createBufferSource();
  src.buffer = buf;
  src.connect(audioCtx.destination);
  const now = audioCtx.currentTime;
  if (nextPlay < now) nextPlay = now;
  src.start(nextPlay);
  nextPlay += buf.duration;
  sources.push(src);
  src.onended = () => { const i = sources.indexOf(src); if (i !== -1) sources.splice(i, 1); };
}

// 5. Cleanup
function stopVoice() {
  ready = false;
  sources.forEach(s => { try { s.stop(); } catch (e) {} });
  sources = []; nextPlay = 0;
  if (micStream) { micStream.getTracks().forEach(t => t.stop()); micStream = null; }
  if (audioCtx) { audioCtx.close().catch(() => {}); audioCtx = null; }
  if (ws) { ws.close(); ws = null; }
}

WebSocket Close Codes

CodeMeaning
1000Normal closure (client or server initiated)
1008Policy violation (e.g., sending audio before initialization)
1011Server error (internal pipeline failure)
4001Authentication failed (invalid API key)
4004Agent not found (invalid agent_id)

Troubleshooting

SymptomCauseFix
Immediate disconnect with code 1008Sending audio before conversation_initiation_metadataWait for the init message before streaming audio
No audio from agentPlaying audio as binary instead of decoding base64 PCM16Decode base64 → Int16Array → Float32Array → AudioBuffer
Agent speaks over itselfNot handling interruption eventsStop all scheduled AudioBufferSourceNodes on interruption
Connection drops after ~30sNot responding to ping messagesSend pong response with the event_id from each ping
Audio is garbledWrong sample rate or encodingEnsure PCM16, 16kHz, mono, little-endian