Programming

Node.js WebSocket Audio Streaming: Exotel to Gemini Integration

Implement real-time bidirectional audio streaming between Exotel Voicebot and Google Gemini using Node.js WebSockets. Learn audio resampling and production best practices.

1 answer 1 view

How to Implement Real-Time Bidirectional Audio Streaming Between Exotel Voicebot and Google Gemini Live API Using WebSockets in Node.js?

Project Setup

Building a real-time AI voicebot in JavaScript with:

  • Exotel for outbound calling and live audio streaming via WebSocket API
  • Google Gemini Live API for speech-to-text and text-to-speech
  • Full-duplex streaming for simultaneous caller input and Gemini responses

Tech Stack

  • Backend: Node.js (Express + ws)
  • Exotel audio: 8 kHz PCM
  • Gemini audio output: 24 kHz PCM
  • Transport: WebSockets (Exotel ↔ Backend ↔ Gemini)

Key Challenges and Questions

1. Exotel Workflow WebSocket URL Role

  • What is the exact purpose of the WSS URL provided in the Exotel workflow?
  • Does it handle only call control, or does all audio flow through it?
  • If Exotel streams start, media, and stop events separately, is the WSS URL just a trigger?

2. Recommended Streaming Architecture

  • Is the standard flow always Exotel → Backend → Gemini → Backend → Exotel?
  • Is direct streaming from Exotel to Gemini possible without a backend relay?

3. Audio Resampling from 24 kHz to 8 kHz

  • Current method: Simple decimation (every 3rd sample)
  • Is this sufficient for production voicebots, or use libraries like SOX, FFmpeg, or Speex?

4. Real-Time Playback Issues (Delays/Choppiness)

  • Suspected causes: chunk size, buffering, timestamps, sequence_number
  • Best practices for smooth, low-latency playback with Exotel?

Node.js Backend Code (Simplified)

js
import express from "express";
import dotenv from "dotenv";
import { WebSocketServer } from "ws";
import { GoogleGenAI, Modality } from "@google/genai";

dotenv.config();
const app = express();
const PORT = process.env.PORT || 3000;

const ai = new GoogleGenAI({
 apiKey: process.env.GEMINI_API_KEY,
});

const EXOTEL_SAMPLE_RATE = 8000;
const GEMINI_OUTPUT_SAMPLE_RATE = 24000;
const DOWNSAMPLE_RATIO = 3;
const PCM_CHUNK_SIZE = 320 * 5;

const server = app.listen(PORT, () => {
 console.log(`Server running on port ${PORT}`);
});

const wss = new WebSocketServer({
 server,
 path: "/exotel/voicebot",
});

wss.on("connection", async (ws) => {
 let streamSid = null;
 let sequenceNumber = 1;
 let chunkNumber = 1;
 let audioBuffer = Buffer.alloc(0);
 let session = null;

 function processGeminiAudio(audioBase64) {
 const geminiBuffer = Buffer.from(audioBase64, "base64");
 const downsampledBuffer = Buffer.alloc(
 Math.floor(geminiBuffer.length / DOWNSAMPLE_RATIO)
 );

 for (let i = 0; i < downsampledBuffer.length / 2; i++) {
 const originalIndex = i * DOWNSAMPLE_RATIO * 2;
 if (originalIndex + 1 < geminiBuffer.length) {
 const sample = geminiBuffer.readInt16LE(originalIndex);
 downsampledBuffer.writeInt16LE(sample, i * 2);
 }
 }

 audioBuffer = Buffer.concat([audioBuffer, downsampledBuffer]);

 while (audioBuffer.length >= PCM_CHUNK_SIZE) {
 const sendChunk = audioBuffer.subarray(0, PCM_CHUNK_SIZE);
 audioBuffer = audioBuffer.subarray(PCM_CHUNK_SIZE);

 ws.send(
 JSON.stringify({
 event: "media",
 sequence_number: sequenceNumber++,
 stream_sid: streamSid,
 media: {
 chunk: chunkNumber++,
 timestamp: String(sequenceNumber * 100),
 payload: sendChunk.toString("base64"),
 },
 })
 );
 }
 }

 session = await ai.live.connect({
 model: "gemini-2.5-flash-native-audio-preview-12-2025",
 config: {
 responseModalities: [Modality.AUDIO],
 },
 callbacks: {
 onmessage: (message) => {
 const audioBase64 =
 message.serverContent?.modelTurn?.parts?.find(
 (p) => p.inlineData
 )?.inlineData?.data;

 if (audioBase64) processGeminiAudio(audioBase64);
 },
 },
 });

 ws.on("message", (raw) => {
 const data = JSON.parse(raw.toString());

 if (data.event === "start") {
 streamSid = data.stream_sid;
 }

 if (data.event === "media" && session) {
 session.sendRealtimeInput({
 media: {
 mimeType: "audio/pcm;rate=8000",
 data: data.media.payload,
 },
 });
 }

 if (data.event === "stop" && session) {
 session.close();
 }
 });
});

Seeking

  • Correct Exotel streaming architecture confirmation
  • Workflow WSS URL clarification
  • Low-latency bidirectional audio best practices
  • Common Exotel voicebot pitfalls and real-world examples

Implementing real-time bidirectional audio streaming between Exotel Voicebot and Google Gemini Live API requires a backend WebSocket relay in Node.js that handles audio format conversion between Exotel’s 8 kHz PCM and Gemini’s 24 kHz PCM output. The Exotel WebSocket URL carries all live audio payload, not just call control events, making the backend relay architecture essential for proper bidirectional streaming. Audio resampling from 24 kHz to 8 kHz requires production-quality techniques beyond simple decimation to maintain voice clarity and avoid choppiness in real-time applications.


Contents


Understanding Exotel and Google Gemini Live API Integration Architecture

When implementing a real-time AI voicebot that bridges Exotel and Google Gemini Live API, understanding the integration architecture is crucial. The Exotel WebSocket URL provided in the Voicebot Applet serves as the primary bidirectional channel for all audio data, not just a trigger for call control events. This means that once a call is established, all audio streams from the caller to the voicebot and from the voicebot back to the caller flow through this WebSocket connection.

According to Exotel’s documentation, the WebSocket connection handles several event types, including start, media, stop, dtmf, mark, and clear events. The start event initializes the stream with a unique stream_sid, while media events contain the actual audio payload in base64-encoded PCM format. The stop event terminates the connection, and clear events are particularly important for managing audio buffers and preventing gaps in the conversation flow.

Regarding the streaming architecture, direct streaming from Exotel to Google Gemini Live API is not possible due to several technical constraints. Exotel’s audio format requirements (8 kHz PCM, 16-bit, mono, little-endian) differ significantly from Gemini’s optimal input/output (24 kHz PCM). Additionally, authentication, event handling, and audio processing require a backend relay to properly manage the bidirectional flow.

The standard production architecture follows this pattern: Exotel → Backend → Gemini → Backend → Exotel. This backend relay serves several critical functions:

  1. Audio format conversion between Exotel’s 8 kHz and Gemini’s 24 kHz PCM
  2. Proper authentication and session management
  3. Event handling and sequence number management
  4. Audio buffering and chunk size optimization
  5. Error recovery and reconnection logic

Authentication methods for Exotel WebSocket connections typically include IP whitelisting or Basic Auth, as detailed in Exotel’s advanced streaming documentation. This ensures that only authorized backend services can establish connections to the streaming endpoint.


Setting Up Node.js WebSocket Server for Bidirectional Audio Streaming

To implement a production-ready WebSocket server for bidirectional audio streaming between Exotel and Google Gemini, you’ll need to properly configure your Node.js environment with the necessary dependencies and implement robust connection handling. The implementation should focus on three core areas: connection management, audio processing, and event handling.

First, let’s establish the necessary dependencies and server configuration:

javascript
import express from "express";
import dotenv from "dotenv";
import { WebSocketServer } from "ws";
import { GoogleGenAI, Modality } from "@google/genai";

dotenv.config();
const app = express();
const PORT = process.env.PORT || 3000;

// Initialize Google Gemini client
const ai = new GoogleGenAI({
 apiKey: process.env.GEMINI_API_KEY,
});

const server = app.listen(PORT, () => {
 console.log(`Server running on port ${PORT}`);
});

// Configure WebSocket server for Exotel connections
const wss = new WebSocketServer({
 server,
 path: "/exotel/voicebot",
});

The WebSocket server must be configured to handle multiple simultaneous connections, each representing an active call session. Each connection requires proper session management to track the stream_sid, sequence numbers, and audio buffers.

javascript
wss.on("connection", async (ws) => {
 let streamSid = null;
 let sequenceNumber = 1;
 let chunkNumber = 1;
 let audioBuffer = Buffer.alloc(0);
 let session = null;

 // Connection authentication (IP whitelisting or Basic Auth)
 const clientIp = ws.socket.remoteAddress;
 if (!isAuthorized(clientIp)) {
 ws.send(JSON.stringify({ error: "Unauthorized" }));
 ws.close();
 return;
 }

 // Handle incoming messages from Exotel
 ws.on("message", (raw) => {
 const data = JSON.parse(raw.toString());

 switch(data.event) {
 case "start":
 handleStartEvent(data, ws);
 break;
 case "media":
 handleMediaEvent(data, session);
 break;
 case "stop":
 handleStopEvent(data, session, ws);
 break;
 case "clear":
 handleClearEvent(data);
 break;
 case "dtmf":
 handleDtmfEvent(data, session);
 break;
 case "mark":
 handleMarkEvent(data);
 break;
 }
 });

 // Error handling
 ws.on("error", (error) => {
 console.error(`WebSocket error: ${error.message}`);
 if (session) {
 session.close();
 }
 });

 // Connection close handling
 ws.on("close", () => {
 console.log(`Connection closed for streamSid: ${streamSid}`);
 if (session) {
 session.close();
 }
 });
});

The event handlers must be carefully implemented to properly manage the bidirectional audio flow. The start handler initializes the Gemini session, while the media handler processes incoming audio and forwards it to Gemini.

javascript
async function handleStartEvent(data, ws) {
 streamSid = data.stream_sid;
 sequenceNumber = 1;
 chunkNumber = 1;
 audioBuffer = Buffer.alloc(0);

 // Initialize Gemini Live API session
 try {
 session = await ai.live.connect({
 model: "gemini-2.5-flash-native-audio-preview-12-2025",
 config: {
 responseModalities: [Modality.AUDIO],
 },
 callbacks: {
 onmessage: (message) => processGeminiAudio(message, ws),
 onerror: (error) => console.error("Gemini error:", error),
 onclose: () => console.log("Gemini session closed"),
 },
 });

 // Send acknowledgment to Exotel
 ws.send(JSON.stringify({ 
 event: "start_ack",
 stream_sid: streamSid 
 }));
 } catch (error) {
 console.error("Failed to initialize Gemini session:", error);
 ws.send(JSON.stringify({ 
 event: "error", 
 error: "Failed to initialize AI session" 
 }));
 }
}

function handleMediaEvent(data, session) {
 if (!session || !data.media?.payload) return;

 // Forward audio to Gemini
 session.sendRealtimeInput({
 media: {
 mimeType: "audio/pcm;rate=8000",
 data: data.media.payload,
 },
 });
}

This implementation establishes the foundation for bidirectional audio streaming. The key is ensuring that all audio events are properly routed between Exotel and Gemini while maintaining connection state and managing sequence numbers correctly.


Implementing Production-Grade Audio Resampling Between 24 kHz and 8 kHz PCM

Audio resampling represents one of the most critical components in the bidirectional streaming architecture between Exotel and Google Gemini. Exotel requires 8 kHz PCM audio, while Gemini produces 24 kHz PCM output. The simple decimation method (taking every 3rd sample) in the provided code is insufficient for production-quality voicebots, as it introduces aliasing artifacts and degrades voice clarity.

For production environments, you should implement professional audio resampling using specialized libraries. Three primary approaches are available: SOX, FFmpeg, and Speex. Each has different performance characteristics and quality levels suitable for different use cases.

SOX-Based Resampling

SOX (Sound eXchange) is a comprehensive command-line audio processing tool that provides high-quality resampling with configurable algorithms:

javascript
const { exec } = require('child_process');

function resampleWithSox(inputBuffer, inputRate, outputRate) {
 return new Promise((resolve, reject) => {
 const tempInput = `/tmp/input_${Date.now()}.wav`;
 const tempOutput = `/tmp/output_${Date.now()}.wav`;
 
 // Write input buffer to temporary file
 require('fs').writeFileSync(tempInput, inputBuffer);
 
 // Execute SOX command
 const command = `sox -r ${inputRate} -e signed-integer -b 16 -c 1 ${tempInput} -r ${outputRate} -e signed-integer -b 16 -c 1 ${tempOutput} rate -h`;
 
 exec(command, (error) => {
 if (error) {
 reject(error);
 return;
 }
 
 // Read resampled output
 const outputBuffer = require('fs').readFileSync(tempOutput);
 
 // Clean up temporary files
 require('fs').unlinkSync(tempInput);
 require('fs').unlinkSync(tempOutput);
 
 resolve(outputBuffer);
 });
 });
}

The rate -h flag in SOX uses a high-quality algorithm suitable for voice applications. While SOX provides excellent quality, it introduces additional latency due to file I/O operations, making it less suitable for ultra-low-latency applications.

FFmpeg-Based Resampling

FFmpeg offers a more efficient approach with lower latency while maintaining high quality:

javascript
const { spawn } = require('child_process');

function resampleWithFFmpeg(inputBuffer, inputRate, outputRate) {
 return new Promise((resolve, reject) => {
 const ffmpeg = spawn('ffmpeg', [
 '-i', 'pipe:0',
 '-ar', outputRate.toString(),
 '-ac', '1',
 '-f', 's16le',
 '-c:a', 'pcm_s16le',
 'pipe:1'
 ]);
 
 const chunks = [];
 ffmpeg.stdout.on('data', (chunk) => chunks.push(chunk));
 
 ffmpeg.on('error', (error) => reject(error));
 ffmpeg.on('close', () => {
 resolve(Buffer.concat(chunks));
 });
 });
}

FFmpeg typically provides better performance than SOX for streaming applications, with lower overhead and more efficient processing of audio streams.

Speex-Based Resampling

For the lowest latency applications, Speex (a specialized voice codec library) offers optimized resampling specifically designed for voice:

javascript
const speex = require('speex');

function resampleWithSpeex(inputBuffer, inputRate, outputRate) {
 // Initialize Speex resampler
 const resampler = new speex.Resampler(1, inputRate, outputRate, speex.ResamplerQuality);
 
 // Convert input buffer to Float32Array
 const floatInput = new Float32Array(inputBuffer.length / 2);
 for (let i = 0; i < floatInput.length; i++) {
 floatInput[i] = inputBuffer.readInt16LE(i * 2) / 32768;
 }
 
 // Calculate output size
 const outputSize = Math.floor(floatInput.length * outputRate / inputRate);
 const floatOutput = new Float32Array(outputSize);
 
 // Perform resampling
 resampler.process(floatInput, floatOutput);
 
 // Convert back to Int16
 const outputBuffer = Buffer.alloc(outputSize * 2);
 for (let i = 0; i < outputSize; i++) {
 outputBuffer.writeInt16LE(Math.max(-32768, Math.min(32767, floatOutput[i] * 32768)), i * 2);
 }
 
 return outputBuffer;
}

Speex provides the best performance for real-time voice applications with minimal latency, making it ideal for voicebot scenarios where responsiveness is critical.

Integrated Resampling Function

For production use, you should integrate these resampling methods with proper error handling and fallback mechanisms:

javascript
async function resampleAudio(geminiBuffer, fromRate = 24000, toRate = 8000) {
 try {
 // Try Speex first for lowest latency
 return resampleWithSpeex(geminiBuffer, fromRate, toRate);
 } catch (speexError) {
 console.warn("Speex resampling failed, falling back to FFmpeg:", speexError.message);
 
 try {
 // Fall back to FFmpeg
 return resampleWithFFmpeg(geminiBuffer, fromRate, toRate);
 } catch (ffmpegError) {
 console.error("FFmpeg resampling failed:", ffmpegError.message);
 
 // Last resort: simple decimation (with anti-aliasing filter)
 return simpleDecimation(geminiBuffer, fromRate, toRate);
 }
 }
}

function simpleDecimation(inputBuffer, fromRate, toRate) {
 const ratio = fromRate / toRate;
 const outputSize = Math.floor(inputBuffer.length / ratio);
 const outputBuffer = Buffer.alloc(outputSize);
 
 // Apply simple anti-aliasing filter before decimation
 const filteredBuffer = applyAntiAliasingFilter(inputBuffer);
 
 for (let i = 0; i < outputSize; i++) {
 const sourceIndex = Math.floor(i * ratio);
 outputBuffer.writeInt16LE(filteredBuffer.readInt16LE(sourceIndex * 2), i * 2);
 }
 
 return outputBuffer;
}

function applyAntiAliasingFilter(inputBuffer) {
 // Simple moving average filter to reduce aliasing
 const filterSize = 3;
 const filteredBuffer = Buffer.alloc(inputBuffer.length);
 
 for (let i = 0; i < inputBuffer.length / 2; i++) {
 let sum = 0;
 let count = 0;
 
 for (let j = Math.max(0, i - Math.floor(filterSize / 2)); 
 j <= Math.min(inputBuffer.length / 2 - 1, i + Math.floor(filterSize / 2)); 
 j++) {
 sum += inputBuffer.readInt16LE(j * 2);
 count++;
 }
 
 filteredBuffer.writeInt16LE(Math.round(sum / count), i * 2);
 }
 
 return filteredBuffer;
}

This integrated approach provides production-quality audio resampling with fallback mechanisms, ensuring your voicebot maintains voice clarity even if the primary resampling method fails.


Optimizing Real-Time Playback with Exotel WebSocket Events

Achieving smooth, low-latency playback in an Exotel-Gemini integration requires careful optimization of several factors: chunk size alignment, proper buffering strategies, timestamp synchronization, and robust event handling. The primary challenges include minimizing latency while preventing audio gaps, choppiness, or delays that degrade the user experience.

Chunk Size Optimization

Exotel recommends chunk sizes between 3.2 kB and 100 kB for WebSocket audio streaming, with optimal performance typically achieved in the 3.2-5 kB range. The current implementation uses a fixed chunk size of 1.6 kB (320 samples × 5), which may be too small for efficient streaming. Increasing the chunk size while staying within the optimal range can significantly improve performance:

javascript
const EXOTEL_MIN_CHUNK_SIZE = 320 * 3; // 3.2 kB minimum
const EXOTEL_OPTIMAL_CHUNK_SIZE = 320 * 10; // 5 kB optimal
const EXOTEL_MAX_CHUNK_SIZE = 320 * 50; // 16 kB maximum (well below 100 kB limit)

// In your audio processing function:
function processGeminiAudio(audioBase64, ws) {
 const geminiBuffer = Buffer.from(audioBase64, "base64");
 const downsampledBuffer = await resampleAudio(geminiBuffer, 24000, 8000);
 
 audioBuffer = Buffer.concat([audioBuffer, downsampledBuffer]);
 
 while (audioBuffer.length >= EXOTEL_OPTIMAL_CHUNK_SIZE) {
 const sendChunk = audioBuffer.subarray(0, EXOTEL_OPTIMAL_CHUNK_SIZE);
 audioBuffer = audioBuffer.subarray(EXOTEL_OPTIMAL_CHUNK_SIZE);
 
 ws.send(
 JSON.stringify({
 event: "media",
 sequence_number: sequenceNumber++,
 stream_sid: streamSid,
 media: {
 chunk: chunkNumber++,
 timestamp: String(Date.now()), // Use actual timestamp
 payload: sendChunk.toString("base64"),
 },
 })
 );
 }
}

Buffering and Timing Strategies

Proper buffering is essential to prevent audio gaps while maintaining low latency. A 200-300ms buffer provides a good balance, allowing for network fluctuations while maintaining responsiveness:

javascript
class AudioBuffer {
 constructor(targetBufferSize = 250) { // 250ms buffer
 this.buffer = Buffer.alloc(0);
 this.targetBufferSize = targetBufferSize;
 this.lastSendTime = Date.now();
 }
 
 addAudio(audioData) {
 this.buffer = Buffer.concat([this.buffer, audioData]);
 this.processBuffer();
 }
 
 processBuffer() {
 const now = Date.now();
 const timeSinceLastSend = now - this.lastSendTime;
 
 // Send if we have enough data or if too much time has passed
 if (this.buffer.length >= this.targetBufferSize || timeSinceLastSend > 100) {
 const chunkSize = Math.min(this.buffer.length, EXOTEL_OPTIMAL_CHUNK_SIZE);
 const sendChunk = this.buffer.subarray(0, chunkSize);
 this.buffer = this.buffer.subarray(chunkSize);
 this.lastSendTime = now;
 
 return sendChunk;
 }
 
 return null;
 }
}

Event Handling and Sequence Management

Proper handling of Exotel’s event types, particularly the clear event, is critical for maintaining audio continuity:

javascript
function handleClearEvent(data) {
 // Reset sequence numbers when Exotel requests a clear
 sequenceNumber = 1;
 chunkNumber = 1;
 
 // Clear any pending audio buffer to prevent gaps
 audioBuffer = Buffer.alloc(0);
 
 console.log(`Audio buffer cleared for streamSid: ${streamSid}`);
}

function handleMarkEvent(data) {
 // Handle mark events for synchronization points
 console.log(`Mark event received: ${data.mark.mark_name}`);
 
 // Optional: Send acknowledgment back to Exotel
 if (ws && ws.readyState === WebSocket.OPEN) {
 ws.send(JSON.stringify({
 event: "mark_ack",
 mark_name: data.mark.mark_name,
 stream_sid: streamSid
 }));
 }
}

Latency Measurement and Optimization

Implementing latency measurement allows you to identify and address performance bottlenecks:

javascript
class LatencyMonitor {
 constructor() {
 this.measurements = [];
 this.maxMeasurements = 100;
 }
 
 measure(inputTime, outputTime) {
 const latency = outputTime - inputTime;
 this.measurements.push(latency);
 
 if (this.measurements.length > this.maxMeasurements) {
 this.measurements.shift();
 }
 
 return this.getAverageLatency();
 }
 
 getAverageLatency() {
 if (this.measurements.length === 0) return 0;
 
 const sum = this.measurements.reduce((acc, val) => acc + val, 0);
 return this.getAverageLatency();
 }
 
 getLatencyStatus() {
 const avgLatency = this.getAverageLatency();
 
 if (avgLatency < 200) return "excellent";
 if (avgLatency < 400) return "good";
 if (avgLatency < 600) return "acceptable";
 return "poor";
 }
}

This optimized implementation addresses the common issues of latency, choppiness, and audio gaps by implementing proper chunk sizing, buffering strategies, timestamp synchronization, and latency monitoring.


Production-Ready Implementation and Common Pitfalls

When deploying a production voicebot system that bridges Exotel and Google Gemini, several critical considerations must be addressed to ensure reliability, performance, and user experience. This section covers the essential production-ready components and highlights common pitfalls that can degrade performance or cause system failures.

Production Architecture Components

A production-ready implementation should include several key components beyond the basic streaming functionality:

javascript
// Production-ready server setup
class ExotelGeminiServer {
 constructor() {
 this.activeStreams = new Map(); // Track active streams by streamSid
 this.sessionPool = new SessionPool(); // Manage Gemini sessions
 this.errorHandler = new ErrorHandler(); // Centralized error handling
 this.metricsCollector = new MetricsCollector(); // Performance monitoring
 }
 
 initialize() {
 // Setup Express server
 this.setupExpressServer();
 
 // Setup WebSocket server
 this.setupWebSocketServer();
 
 // Setup monitoring and logging
 this.setupMonitoring();
 
 // Setup error recovery
 this.setupErrorRecovery();
 }
 
 setupWebSocketServer() {
 const wss = new WebSocketServer({
 server: this.server,
 path: "/exotel/voicebot",
 maxPayload: 10 * 1024 * 1024, // 10MB max payload
 });
 
 wss.on("connection", (ws) => {
 this.handleNewConnection(ws);
 });
 
 // Handle server shutdown
 process.on('SIGTERM', () => this.gracefulShutdown(wss));
 process.on('SIGINT', () => this.gracefulShutdown(wss));
 }
 
 handleNewConnection(ws) {
 // Authentication check
 if (!this.authenticateConnection(ws)) {
 ws.close(1008, "Unauthorized");
 return;
 }
 
 // Setup connection handlers
 ws.on('message', (data) => this.handleMessage(ws, data));
 ws.on('error', (error) => this.handleError(ws, error));
 ws.on('close', () => this.handleConnectionClose(ws));
 }
 
 authenticateConnection(ws) {
 // Implement your authentication logic
 // This could be IP whitelisting, JWT tokens, or Basic Auth
 const clientIp = ws.socket.remoteAddress;
 return this.isAuthorized(clientIp);
 }
}

Error Handling and Recovery

Robust error handling is essential for production systems:

javascript
class ErrorHandler {
 constructor() {
 this.errorCounts = new Map();
 this.maxErrorRetries = 3;
 }
 
 handleWebSocketError(ws, error) {
 const errorType = error.constructor.name;
 this.incrementErrorCount(errorType);
 
 console.error(`WebSocket error (${errorType}):`, error.message);
 
 // Attempt recovery for certain error types
 if (this.isRecoverable(error)) {
 return this.attemptRecovery(ws, error);
 }
 
 // For non-recoverable errors, close the connection
 ws.close(1011, "Internal error");
 }
 
 handleGeminiError(session, error) {
 console.error("Gemini API error:", error.message);
 
 // Attempt to reinitialize the session
 if (session) {
 this.reinitializeSession(session);
 }
 }
 
 isRecoverable(error) {
 // Define which errors are recoverable
 const recoverableErrors = [
 "WebSocketConnectionError",
 "TemporaryNetworkError",
 "RateLimitError"
 ];
 
 return recoverableErrors.includes(error.constructor.name);
 }
 
 async attemptRecovery(ws, error) {
 // Implement your recovery logic
 // This might involve reconnecting, reinitializing sessions, etc.
 console.log("Attempting recovery for error:", error.message);
 
 // Your recovery implementation here
 }
}

Common Pitfalls and Solutions

1. Chunk Size Misalignment

Problem: Using chunk sizes that don’t align with Exotel’s expectations can cause streaming issues.

Solution: Implement chunk size validation and adjustment:

javascript
function validateChunkSize(chunkSize) {
 const MIN_CHUNK_SIZE = 320 * 3; // 3.2 kB
 const MAX_CHUNK_SIZE = 320 * 50; // 16 kB
 const OPTIMAL_CHUNK_SIZE = 320 * 10; // 5 kB
 
 if (chunkSize < MIN_CHUNK_SIZE) {
 console.warn(`Chunk size ${chunkSize} is below minimum ${MIN_CHUNK_SIZE}`);
 return MIN_CHUNK_SIZE;
 }
 
 if (chunkSize > MAX_CHUNK_SIZE) {
 console.warn(`Chunk size ${chunkSize} exceeds maximum ${MAX_CHUNK_SIZE}`);
 return MAX_CHUNK_SIZE;
 }
 
 // Align to optimal size if possible
 if (chunkSize !== OPTIMAL_CHUNK_SIZE) {
 const diff = Math.abs(chunkSize - OPTIMAL_CHUNK_SIZE);
 if (diff < 320) { // Less than 1 sample difference
 return OPTIMAL_CHUNK_SIZE;
 }
 }
 
 return chunkSize;
}

2. Insufficient Audio Resampling Quality

Problem: Simple decimation methods produce poor-quality audio that sounds robotic or has artifacts.

Solution: Implement proper resampling with quality fallbacks:

javascript
async function highQualityResampling(inputBuffer, fromRate, toRate) {
 try {
 // Try high-quality resampling first
 return await resampleWithSpeex(inputBuffer, fromRate, toRate);
 } catch (error) {
 console.warn("High-quality resampling failed, falling back to FFmpeg");
 try {
 return await resampleWithFFmpeg(inputBuffer, fromRate, toRate);
 } catch (fallbackError) {
 console.error("Fallback resampling also failed");
 // Return original buffer with warning
 console.warn("Using original buffer without resampling");
 return inputBuffer;
 }
 }
}

3. Missing Clear Events

Problem: Failing to properly handle clear events can cause audio buffer overflow and gaps in conversation.

Solution: Implement proper clear event handling:

javascript
function handleClearEvent(data, streamSid) {
 const stream = this.activeStreams.get(streamSid);
 
 if (!stream) {
 console.warn(`Clear event received for unknown stream: ${streamSid}`);
 return;
 }
 
 // Reset stream state
 stream.sequenceNumber = 1;
 stream.chunkNumber = 1;
 stream.audioBuffer = Buffer.alloc(0);
 stream.lastTimestamp = null;
 
 // Send acknowledgment
 this.sendToExotel(streamSid, {
 event: "clear_ack",
 stream_sid: streamSid
 });
 
 console.log(`Audio buffer cleared for stream: ${streamSid}`);
}

4. Wrong Sample Rate Configuration

Problem: Incorrectly configured sample rates can cause audio playback issues or complete silence.

Solution: Implement sample rate validation:

javascript
function validateSampleConfiguration() {
 const expectedInputRate = 8000; // Exotel input
 const expectedOutputRate = 24000; // Gemini output
 
 if (this.config.inputSampleRate !== expectedInputRate) {
 throw new Error(`Invalid input sample rate: expected ${expectedInputRate}, got ${this.config.inputSampleRate}`);
 }
 
 if (this.config.outputSampleRate !== expectedOutputRate) {
 throw new Error(`Invalid output sample rate: expected ${expectedOutputRate}, got ${this.config.outputSampleRate}`);
 }
 
 console.log("Sample rate configuration is valid");
}

By addressing these production considerations and avoiding common pitfalls, you can build a robust, high-performance voicebot system that delivers reliable real-time audio streaming between Exotel and Google Gemini.


Sources

  1. Exotel Stream and VoiceBot Applet Documentation — Technical specifications about WebSocket audio format and event handling: https://support.exotel.com/support/solutions/articles/3000108630-working-with-the-stream-and-voicebot-applet
  2. Google Gemini Live API Documentation — Implementation details for connecting and handling real-time audio: https://ai.google.dev/gemini-api/docs/live
  3. Exotel Pipecat AgentStream Guide — Production best practices for audio streaming and resampling: https://exotel.com/blog/exotel-pipecat-agentstream-guide/
  4. Exotel Quick Guide to Streaming Services — Minimum requirements and basic setup instructions for WebSocket connections: https://support.exotel.com/support/solutions/articles/3000132268-quick-guide-to-get-started-with-exotel-streaming-services
  5. Exotel AgentStream VoiceBot Applet Configuration — Detailed audio format specifications and bidirectional streaming requirements: https://docs.exotel.com/exotel-agentstream/voicebot-applet
  6. Exotel AgentStream Advanced Features — Advanced event handling and authentication methods beyond basic media events: https://docs.exotel.com/exotel-agentstream/advanced
  7. Exotel Voice Streaming Product Overview — High-level description of bidirectional streaming capabilities between Exotel and external platforms: https://exotel.com/products/voice-streaming/

Conclusion

Implementing real-time bidirectional audio streaming between Exotel Voicebot and Google Gemini Live API using WebSockets in Node.js requires careful attention to several critical factors. The node js websocket implementation must serve as a robust backend relay that handles audio format conversion between Exotel’s 8 kHz PCM and Gemini’s 24 kHz PCM output. The Exotel WebSocket URL carries all live audio payload, making the backend architecture essential for proper bidirectional streaming.

Production-quality implementations must address audio resampling with specialized libraries like SOX, FFmpeg, or Speex rather than simple decimation methods. Proper chunk size optimization (3.2-5 kB range), buffering strategies (200-300ms), timestamp synchronization are crucial for achieving low-latency, smooth playback. Common pitfalls include chunk size misalignment, insufficient resampling quality, missing clear events, wrong sample rate configuration, authentication problems.

Authors
Verified by moderation
Moderation
Node.js WebSocket Audio Streaming: Exotel to Gemini Integration