Node.js WebSocket Audio Streaming: Exotel to Gemini Integration
Implement real-time bidirectional audio streaming between Exotel Voicebot and Google Gemini using Node.js WebSockets. Learn audio resampling and production best practices.
How to Implement Real-Time Bidirectional Audio Streaming Between Exotel Voicebot and Google Gemini Live API Using WebSockets in Node.js?
Project Setup
Building a real-time AI voicebot in JavaScript with:
- Exotel for outbound calling and live audio streaming via WebSocket API
- Google Gemini Live API for speech-to-text and text-to-speech
- Full-duplex streaming for simultaneous caller input and Gemini responses
Tech Stack
- Backend: Node.js (Express +
ws) - Exotel audio: 8 kHz PCM
- Gemini audio output: 24 kHz PCM
- Transport: WebSockets (Exotel ↔ Backend ↔ Gemini)
Key Challenges and Questions
1. Exotel Workflow WebSocket URL Role
- What is the exact purpose of the WSS URL provided in the Exotel workflow?
- Does it handle only call control, or does all audio flow through it?
- If Exotel streams
start,media, andstopevents separately, is the WSS URL just a trigger?
2. Recommended Streaming Architecture
- Is the standard flow always Exotel → Backend → Gemini → Backend → Exotel?
- Is direct streaming from Exotel to Gemini possible without a backend relay?
3. Audio Resampling from 24 kHz to 8 kHz
- Current method: Simple decimation (every 3rd sample)
- Is this sufficient for production voicebots, or use libraries like SOX, FFmpeg, or Speex?
4. Real-Time Playback Issues (Delays/Choppiness)
- Suspected causes: chunk size, buffering, timestamps,
sequence_number - Best practices for smooth, low-latency playback with Exotel?
Node.js Backend Code (Simplified)
import express from "express";
import dotenv from "dotenv";
import { WebSocketServer } from "ws";
import { GoogleGenAI, Modality } from "@google/genai";
dotenv.config();
const app = express();
const PORT = process.env.PORT || 3000;
const ai = new GoogleGenAI({
apiKey: process.env.GEMINI_API_KEY,
});
const EXOTEL_SAMPLE_RATE = 8000;
const GEMINI_OUTPUT_SAMPLE_RATE = 24000;
const DOWNSAMPLE_RATIO = 3;
const PCM_CHUNK_SIZE = 320 * 5;
const server = app.listen(PORT, () => {
console.log(`Server running on port ${PORT}`);
});
const wss = new WebSocketServer({
server,
path: "/exotel/voicebot",
});
wss.on("connection", async (ws) => {
let streamSid = null;
let sequenceNumber = 1;
let chunkNumber = 1;
let audioBuffer = Buffer.alloc(0);
let session = null;
function processGeminiAudio(audioBase64) {
const geminiBuffer = Buffer.from(audioBase64, "base64");
const downsampledBuffer = Buffer.alloc(
Math.floor(geminiBuffer.length / DOWNSAMPLE_RATIO)
);
for (let i = 0; i < downsampledBuffer.length / 2; i++) {
const originalIndex = i * DOWNSAMPLE_RATIO * 2;
if (originalIndex + 1 < geminiBuffer.length) {
const sample = geminiBuffer.readInt16LE(originalIndex);
downsampledBuffer.writeInt16LE(sample, i * 2);
}
}
audioBuffer = Buffer.concat([audioBuffer, downsampledBuffer]);
while (audioBuffer.length >= PCM_CHUNK_SIZE) {
const sendChunk = audioBuffer.subarray(0, PCM_CHUNK_SIZE);
audioBuffer = audioBuffer.subarray(PCM_CHUNK_SIZE);
ws.send(
JSON.stringify({
event: "media",
sequence_number: sequenceNumber++,
stream_sid: streamSid,
media: {
chunk: chunkNumber++,
timestamp: String(sequenceNumber * 100),
payload: sendChunk.toString("base64"),
},
})
);
}
}
session = await ai.live.connect({
model: "gemini-2.5-flash-native-audio-preview-12-2025",
config: {
responseModalities: [Modality.AUDIO],
},
callbacks: {
onmessage: (message) => {
const audioBase64 =
message.serverContent?.modelTurn?.parts?.find(
(p) => p.inlineData
)?.inlineData?.data;
if (audioBase64) processGeminiAudio(audioBase64);
},
},
});
ws.on("message", (raw) => {
const data = JSON.parse(raw.toString());
if (data.event === "start") {
streamSid = data.stream_sid;
}
if (data.event === "media" && session) {
session.sendRealtimeInput({
media: {
mimeType: "audio/pcm;rate=8000",
data: data.media.payload,
},
});
}
if (data.event === "stop" && session) {
session.close();
}
});
});
Seeking
- Correct Exotel streaming architecture confirmation
- Workflow WSS URL clarification
- Low-latency bidirectional audio best practices
- Common Exotel voicebot pitfalls and real-world examples
Implementing real-time bidirectional audio streaming between Exotel Voicebot and Google Gemini Live API requires a backend WebSocket relay in Node.js that handles audio format conversion between Exotel’s 8 kHz PCM and Gemini’s 24 kHz PCM output. The Exotel WebSocket URL carries all live audio payload, not just call control events, making the backend relay architecture essential for proper bidirectional streaming. Audio resampling from 24 kHz to 8 kHz requires production-quality techniques beyond simple decimation to maintain voice clarity and avoid choppiness in real-time applications.
Contents
- Understanding Exotel and Google Gemini Live API Integration Architecture
- Setting Up Node.js WebSocket Server for Bidirectional Audio Streaming
- Implementing Production-Grade Audio Resampling Between 24 kHz and 8 kHz PCM
- Optimizing Real-Time Playback with Exotel WebSocket Events
- Production-Ready Implementation and Common Pitfalls
Understanding Exotel and Google Gemini Live API Integration Architecture
When implementing a real-time AI voicebot that bridges Exotel and Google Gemini Live API, understanding the integration architecture is crucial. The Exotel WebSocket URL provided in the Voicebot Applet serves as the primary bidirectional channel for all audio data, not just a trigger for call control events. This means that once a call is established, all audio streams from the caller to the voicebot and from the voicebot back to the caller flow through this WebSocket connection.
According to Exotel’s documentation, the WebSocket connection handles several event types, including start, media, stop, dtmf, mark, and clear events. The start event initializes the stream with a unique stream_sid, while media events contain the actual audio payload in base64-encoded PCM format. The stop event terminates the connection, and clear events are particularly important for managing audio buffers and preventing gaps in the conversation flow.
Regarding the streaming architecture, direct streaming from Exotel to Google Gemini Live API is not possible due to several technical constraints. Exotel’s audio format requirements (8 kHz PCM, 16-bit, mono, little-endian) differ significantly from Gemini’s optimal input/output (24 kHz PCM). Additionally, authentication, event handling, and audio processing require a backend relay to properly manage the bidirectional flow.
The standard production architecture follows this pattern: Exotel → Backend → Gemini → Backend → Exotel. This backend relay serves several critical functions:
- Audio format conversion between Exotel’s 8 kHz and Gemini’s 24 kHz PCM
- Proper authentication and session management
- Event handling and sequence number management
- Audio buffering and chunk size optimization
- Error recovery and reconnection logic
Authentication methods for Exotel WebSocket connections typically include IP whitelisting or Basic Auth, as detailed in Exotel’s advanced streaming documentation. This ensures that only authorized backend services can establish connections to the streaming endpoint.
Setting Up Node.js WebSocket Server for Bidirectional Audio Streaming
To implement a production-ready WebSocket server for bidirectional audio streaming between Exotel and Google Gemini, you’ll need to properly configure your Node.js environment with the necessary dependencies and implement robust connection handling. The implementation should focus on three core areas: connection management, audio processing, and event handling.
First, let’s establish the necessary dependencies and server configuration:
import express from "express";
import dotenv from "dotenv";
import { WebSocketServer } from "ws";
import { GoogleGenAI, Modality } from "@google/genai";
dotenv.config();
const app = express();
const PORT = process.env.PORT || 3000;
// Initialize Google Gemini client
const ai = new GoogleGenAI({
apiKey: process.env.GEMINI_API_KEY,
});
const server = app.listen(PORT, () => {
console.log(`Server running on port ${PORT}`);
});
// Configure WebSocket server for Exotel connections
const wss = new WebSocketServer({
server,
path: "/exotel/voicebot",
});
The WebSocket server must be configured to handle multiple simultaneous connections, each representing an active call session. Each connection requires proper session management to track the stream_sid, sequence numbers, and audio buffers.
wss.on("connection", async (ws) => {
let streamSid = null;
let sequenceNumber = 1;
let chunkNumber = 1;
let audioBuffer = Buffer.alloc(0);
let session = null;
// Connection authentication (IP whitelisting or Basic Auth)
const clientIp = ws.socket.remoteAddress;
if (!isAuthorized(clientIp)) {
ws.send(JSON.stringify({ error: "Unauthorized" }));
ws.close();
return;
}
// Handle incoming messages from Exotel
ws.on("message", (raw) => {
const data = JSON.parse(raw.toString());
switch(data.event) {
case "start":
handleStartEvent(data, ws);
break;
case "media":
handleMediaEvent(data, session);
break;
case "stop":
handleStopEvent(data, session, ws);
break;
case "clear":
handleClearEvent(data);
break;
case "dtmf":
handleDtmfEvent(data, session);
break;
case "mark":
handleMarkEvent(data);
break;
}
});
// Error handling
ws.on("error", (error) => {
console.error(`WebSocket error: ${error.message}`);
if (session) {
session.close();
}
});
// Connection close handling
ws.on("close", () => {
console.log(`Connection closed for streamSid: ${streamSid}`);
if (session) {
session.close();
}
});
});
The event handlers must be carefully implemented to properly manage the bidirectional audio flow. The start handler initializes the Gemini session, while the media handler processes incoming audio and forwards it to Gemini.
async function handleStartEvent(data, ws) {
streamSid = data.stream_sid;
sequenceNumber = 1;
chunkNumber = 1;
audioBuffer = Buffer.alloc(0);
// Initialize Gemini Live API session
try {
session = await ai.live.connect({
model: "gemini-2.5-flash-native-audio-preview-12-2025",
config: {
responseModalities: [Modality.AUDIO],
},
callbacks: {
onmessage: (message) => processGeminiAudio(message, ws),
onerror: (error) => console.error("Gemini error:", error),
onclose: () => console.log("Gemini session closed"),
},
});
// Send acknowledgment to Exotel
ws.send(JSON.stringify({
event: "start_ack",
stream_sid: streamSid
}));
} catch (error) {
console.error("Failed to initialize Gemini session:", error);
ws.send(JSON.stringify({
event: "error",
error: "Failed to initialize AI session"
}));
}
}
function handleMediaEvent(data, session) {
if (!session || !data.media?.payload) return;
// Forward audio to Gemini
session.sendRealtimeInput({
media: {
mimeType: "audio/pcm;rate=8000",
data: data.media.payload,
},
});
}
This implementation establishes the foundation for bidirectional audio streaming. The key is ensuring that all audio events are properly routed between Exotel and Gemini while maintaining connection state and managing sequence numbers correctly.
Implementing Production-Grade Audio Resampling Between 24 kHz and 8 kHz PCM
Audio resampling represents one of the most critical components in the bidirectional streaming architecture between Exotel and Google Gemini. Exotel requires 8 kHz PCM audio, while Gemini produces 24 kHz PCM output. The simple decimation method (taking every 3rd sample) in the provided code is insufficient for production-quality voicebots, as it introduces aliasing artifacts and degrades voice clarity.
For production environments, you should implement professional audio resampling using specialized libraries. Three primary approaches are available: SOX, FFmpeg, and Speex. Each has different performance characteristics and quality levels suitable for different use cases.
SOX-Based Resampling
SOX (Sound eXchange) is a comprehensive command-line audio processing tool that provides high-quality resampling with configurable algorithms:
const { exec } = require('child_process');
function resampleWithSox(inputBuffer, inputRate, outputRate) {
return new Promise((resolve, reject) => {
const tempInput = `/tmp/input_${Date.now()}.wav`;
const tempOutput = `/tmp/output_${Date.now()}.wav`;
// Write input buffer to temporary file
require('fs').writeFileSync(tempInput, inputBuffer);
// Execute SOX command
const command = `sox -r ${inputRate} -e signed-integer -b 16 -c 1 ${tempInput} -r ${outputRate} -e signed-integer -b 16 -c 1 ${tempOutput} rate -h`;
exec(command, (error) => {
if (error) {
reject(error);
return;
}
// Read resampled output
const outputBuffer = require('fs').readFileSync(tempOutput);
// Clean up temporary files
require('fs').unlinkSync(tempInput);
require('fs').unlinkSync(tempOutput);
resolve(outputBuffer);
});
});
}
The rate -h flag in SOX uses a high-quality algorithm suitable for voice applications. While SOX provides excellent quality, it introduces additional latency due to file I/O operations, making it less suitable for ultra-low-latency applications.
FFmpeg-Based Resampling
FFmpeg offers a more efficient approach with lower latency while maintaining high quality:
const { spawn } = require('child_process');
function resampleWithFFmpeg(inputBuffer, inputRate, outputRate) {
return new Promise((resolve, reject) => {
const ffmpeg = spawn('ffmpeg', [
'-i', 'pipe:0',
'-ar', outputRate.toString(),
'-ac', '1',
'-f', 's16le',
'-c:a', 'pcm_s16le',
'pipe:1'
]);
const chunks = [];
ffmpeg.stdout.on('data', (chunk) => chunks.push(chunk));
ffmpeg.on('error', (error) => reject(error));
ffmpeg.on('close', () => {
resolve(Buffer.concat(chunks));
});
});
}
FFmpeg typically provides better performance than SOX for streaming applications, with lower overhead and more efficient processing of audio streams.
Speex-Based Resampling
For the lowest latency applications, Speex (a specialized voice codec library) offers optimized resampling specifically designed for voice:
const speex = require('speex');
function resampleWithSpeex(inputBuffer, inputRate, outputRate) {
// Initialize Speex resampler
const resampler = new speex.Resampler(1, inputRate, outputRate, speex.ResamplerQuality);
// Convert input buffer to Float32Array
const floatInput = new Float32Array(inputBuffer.length / 2);
for (let i = 0; i < floatInput.length; i++) {
floatInput[i] = inputBuffer.readInt16LE(i * 2) / 32768;
}
// Calculate output size
const outputSize = Math.floor(floatInput.length * outputRate / inputRate);
const floatOutput = new Float32Array(outputSize);
// Perform resampling
resampler.process(floatInput, floatOutput);
// Convert back to Int16
const outputBuffer = Buffer.alloc(outputSize * 2);
for (let i = 0; i < outputSize; i++) {
outputBuffer.writeInt16LE(Math.max(-32768, Math.min(32767, floatOutput[i] * 32768)), i * 2);
}
return outputBuffer;
}
Speex provides the best performance for real-time voice applications with minimal latency, making it ideal for voicebot scenarios where responsiveness is critical.
Integrated Resampling Function
For production use, you should integrate these resampling methods with proper error handling and fallback mechanisms:
async function resampleAudio(geminiBuffer, fromRate = 24000, toRate = 8000) {
try {
// Try Speex first for lowest latency
return resampleWithSpeex(geminiBuffer, fromRate, toRate);
} catch (speexError) {
console.warn("Speex resampling failed, falling back to FFmpeg:", speexError.message);
try {
// Fall back to FFmpeg
return resampleWithFFmpeg(geminiBuffer, fromRate, toRate);
} catch (ffmpegError) {
console.error("FFmpeg resampling failed:", ffmpegError.message);
// Last resort: simple decimation (with anti-aliasing filter)
return simpleDecimation(geminiBuffer, fromRate, toRate);
}
}
}
function simpleDecimation(inputBuffer, fromRate, toRate) {
const ratio = fromRate / toRate;
const outputSize = Math.floor(inputBuffer.length / ratio);
const outputBuffer = Buffer.alloc(outputSize);
// Apply simple anti-aliasing filter before decimation
const filteredBuffer = applyAntiAliasingFilter(inputBuffer);
for (let i = 0; i < outputSize; i++) {
const sourceIndex = Math.floor(i * ratio);
outputBuffer.writeInt16LE(filteredBuffer.readInt16LE(sourceIndex * 2), i * 2);
}
return outputBuffer;
}
function applyAntiAliasingFilter(inputBuffer) {
// Simple moving average filter to reduce aliasing
const filterSize = 3;
const filteredBuffer = Buffer.alloc(inputBuffer.length);
for (let i = 0; i < inputBuffer.length / 2; i++) {
let sum = 0;
let count = 0;
for (let j = Math.max(0, i - Math.floor(filterSize / 2));
j <= Math.min(inputBuffer.length / 2 - 1, i + Math.floor(filterSize / 2));
j++) {
sum += inputBuffer.readInt16LE(j * 2);
count++;
}
filteredBuffer.writeInt16LE(Math.round(sum / count), i * 2);
}
return filteredBuffer;
}
This integrated approach provides production-quality audio resampling with fallback mechanisms, ensuring your voicebot maintains voice clarity even if the primary resampling method fails.
Optimizing Real-Time Playback with Exotel WebSocket Events
Achieving smooth, low-latency playback in an Exotel-Gemini integration requires careful optimization of several factors: chunk size alignment, proper buffering strategies, timestamp synchronization, and robust event handling. The primary challenges include minimizing latency while preventing audio gaps, choppiness, or delays that degrade the user experience.
Chunk Size Optimization
Exotel recommends chunk sizes between 3.2 kB and 100 kB for WebSocket audio streaming, with optimal performance typically achieved in the 3.2-5 kB range. The current implementation uses a fixed chunk size of 1.6 kB (320 samples × 5), which may be too small for efficient streaming. Increasing the chunk size while staying within the optimal range can significantly improve performance:
const EXOTEL_MIN_CHUNK_SIZE = 320 * 3; // 3.2 kB minimum
const EXOTEL_OPTIMAL_CHUNK_SIZE = 320 * 10; // 5 kB optimal
const EXOTEL_MAX_CHUNK_SIZE = 320 * 50; // 16 kB maximum (well below 100 kB limit)
// In your audio processing function:
function processGeminiAudio(audioBase64, ws) {
const geminiBuffer = Buffer.from(audioBase64, "base64");
const downsampledBuffer = await resampleAudio(geminiBuffer, 24000, 8000);
audioBuffer = Buffer.concat([audioBuffer, downsampledBuffer]);
while (audioBuffer.length >= EXOTEL_OPTIMAL_CHUNK_SIZE) {
const sendChunk = audioBuffer.subarray(0, EXOTEL_OPTIMAL_CHUNK_SIZE);
audioBuffer = audioBuffer.subarray(EXOTEL_OPTIMAL_CHUNK_SIZE);
ws.send(
JSON.stringify({
event: "media",
sequence_number: sequenceNumber++,
stream_sid: streamSid,
media: {
chunk: chunkNumber++,
timestamp: String(Date.now()), // Use actual timestamp
payload: sendChunk.toString("base64"),
},
})
);
}
}
Buffering and Timing Strategies
Proper buffering is essential to prevent audio gaps while maintaining low latency. A 200-300ms buffer provides a good balance, allowing for network fluctuations while maintaining responsiveness:
class AudioBuffer {
constructor(targetBufferSize = 250) { // 250ms buffer
this.buffer = Buffer.alloc(0);
this.targetBufferSize = targetBufferSize;
this.lastSendTime = Date.now();
}
addAudio(audioData) {
this.buffer = Buffer.concat([this.buffer, audioData]);
this.processBuffer();
}
processBuffer() {
const now = Date.now();
const timeSinceLastSend = now - this.lastSendTime;
// Send if we have enough data or if too much time has passed
if (this.buffer.length >= this.targetBufferSize || timeSinceLastSend > 100) {
const chunkSize = Math.min(this.buffer.length, EXOTEL_OPTIMAL_CHUNK_SIZE);
const sendChunk = this.buffer.subarray(0, chunkSize);
this.buffer = this.buffer.subarray(chunkSize);
this.lastSendTime = now;
return sendChunk;
}
return null;
}
}
Event Handling and Sequence Management
Proper handling of Exotel’s event types, particularly the clear event, is critical for maintaining audio continuity:
function handleClearEvent(data) {
// Reset sequence numbers when Exotel requests a clear
sequenceNumber = 1;
chunkNumber = 1;
// Clear any pending audio buffer to prevent gaps
audioBuffer = Buffer.alloc(0);
console.log(`Audio buffer cleared for streamSid: ${streamSid}`);
}
function handleMarkEvent(data) {
// Handle mark events for synchronization points
console.log(`Mark event received: ${data.mark.mark_name}`);
// Optional: Send acknowledgment back to Exotel
if (ws && ws.readyState === WebSocket.OPEN) {
ws.send(JSON.stringify({
event: "mark_ack",
mark_name: data.mark.mark_name,
stream_sid: streamSid
}));
}
}
Latency Measurement and Optimization
Implementing latency measurement allows you to identify and address performance bottlenecks:
class LatencyMonitor {
constructor() {
this.measurements = [];
this.maxMeasurements = 100;
}
measure(inputTime, outputTime) {
const latency = outputTime - inputTime;
this.measurements.push(latency);
if (this.measurements.length > this.maxMeasurements) {
this.measurements.shift();
}
return this.getAverageLatency();
}
getAverageLatency() {
if (this.measurements.length === 0) return 0;
const sum = this.measurements.reduce((acc, val) => acc + val, 0);
return this.getAverageLatency();
}
getLatencyStatus() {
const avgLatency = this.getAverageLatency();
if (avgLatency < 200) return "excellent";
if (avgLatency < 400) return "good";
if (avgLatency < 600) return "acceptable";
return "poor";
}
}
This optimized implementation addresses the common issues of latency, choppiness, and audio gaps by implementing proper chunk sizing, buffering strategies, timestamp synchronization, and latency monitoring.
Production-Ready Implementation and Common Pitfalls
When deploying a production voicebot system that bridges Exotel and Google Gemini, several critical considerations must be addressed to ensure reliability, performance, and user experience. This section covers the essential production-ready components and highlights common pitfalls that can degrade performance or cause system failures.
Production Architecture Components
A production-ready implementation should include several key components beyond the basic streaming functionality:
// Production-ready server setup
class ExotelGeminiServer {
constructor() {
this.activeStreams = new Map(); // Track active streams by streamSid
this.sessionPool = new SessionPool(); // Manage Gemini sessions
this.errorHandler = new ErrorHandler(); // Centralized error handling
this.metricsCollector = new MetricsCollector(); // Performance monitoring
}
initialize() {
// Setup Express server
this.setupExpressServer();
// Setup WebSocket server
this.setupWebSocketServer();
// Setup monitoring and logging
this.setupMonitoring();
// Setup error recovery
this.setupErrorRecovery();
}
setupWebSocketServer() {
const wss = new WebSocketServer({
server: this.server,
path: "/exotel/voicebot",
maxPayload: 10 * 1024 * 1024, // 10MB max payload
});
wss.on("connection", (ws) => {
this.handleNewConnection(ws);
});
// Handle server shutdown
process.on('SIGTERM', () => this.gracefulShutdown(wss));
process.on('SIGINT', () => this.gracefulShutdown(wss));
}
handleNewConnection(ws) {
// Authentication check
if (!this.authenticateConnection(ws)) {
ws.close(1008, "Unauthorized");
return;
}
// Setup connection handlers
ws.on('message', (data) => this.handleMessage(ws, data));
ws.on('error', (error) => this.handleError(ws, error));
ws.on('close', () => this.handleConnectionClose(ws));
}
authenticateConnection(ws) {
// Implement your authentication logic
// This could be IP whitelisting, JWT tokens, or Basic Auth
const clientIp = ws.socket.remoteAddress;
return this.isAuthorized(clientIp);
}
}
Error Handling and Recovery
Robust error handling is essential for production systems:
class ErrorHandler {
constructor() {
this.errorCounts = new Map();
this.maxErrorRetries = 3;
}
handleWebSocketError(ws, error) {
const errorType = error.constructor.name;
this.incrementErrorCount(errorType);
console.error(`WebSocket error (${errorType}):`, error.message);
// Attempt recovery for certain error types
if (this.isRecoverable(error)) {
return this.attemptRecovery(ws, error);
}
// For non-recoverable errors, close the connection
ws.close(1011, "Internal error");
}
handleGeminiError(session, error) {
console.error("Gemini API error:", error.message);
// Attempt to reinitialize the session
if (session) {
this.reinitializeSession(session);
}
}
isRecoverable(error) {
// Define which errors are recoverable
const recoverableErrors = [
"WebSocketConnectionError",
"TemporaryNetworkError",
"RateLimitError"
];
return recoverableErrors.includes(error.constructor.name);
}
async attemptRecovery(ws, error) {
// Implement your recovery logic
// This might involve reconnecting, reinitializing sessions, etc.
console.log("Attempting recovery for error:", error.message);
// Your recovery implementation here
}
}
Common Pitfalls and Solutions
1. Chunk Size Misalignment
Problem: Using chunk sizes that don’t align with Exotel’s expectations can cause streaming issues.
Solution: Implement chunk size validation and adjustment:
function validateChunkSize(chunkSize) {
const MIN_CHUNK_SIZE = 320 * 3; // 3.2 kB
const MAX_CHUNK_SIZE = 320 * 50; // 16 kB
const OPTIMAL_CHUNK_SIZE = 320 * 10; // 5 kB
if (chunkSize < MIN_CHUNK_SIZE) {
console.warn(`Chunk size ${chunkSize} is below minimum ${MIN_CHUNK_SIZE}`);
return MIN_CHUNK_SIZE;
}
if (chunkSize > MAX_CHUNK_SIZE) {
console.warn(`Chunk size ${chunkSize} exceeds maximum ${MAX_CHUNK_SIZE}`);
return MAX_CHUNK_SIZE;
}
// Align to optimal size if possible
if (chunkSize !== OPTIMAL_CHUNK_SIZE) {
const diff = Math.abs(chunkSize - OPTIMAL_CHUNK_SIZE);
if (diff < 320) { // Less than 1 sample difference
return OPTIMAL_CHUNK_SIZE;
}
}
return chunkSize;
}
2. Insufficient Audio Resampling Quality
Problem: Simple decimation methods produce poor-quality audio that sounds robotic or has artifacts.
Solution: Implement proper resampling with quality fallbacks:
async function highQualityResampling(inputBuffer, fromRate, toRate) {
try {
// Try high-quality resampling first
return await resampleWithSpeex(inputBuffer, fromRate, toRate);
} catch (error) {
console.warn("High-quality resampling failed, falling back to FFmpeg");
try {
return await resampleWithFFmpeg(inputBuffer, fromRate, toRate);
} catch (fallbackError) {
console.error("Fallback resampling also failed");
// Return original buffer with warning
console.warn("Using original buffer without resampling");
return inputBuffer;
}
}
}
3. Missing Clear Events
Problem: Failing to properly handle clear events can cause audio buffer overflow and gaps in conversation.
Solution: Implement proper clear event handling:
function handleClearEvent(data, streamSid) {
const stream = this.activeStreams.get(streamSid);
if (!stream) {
console.warn(`Clear event received for unknown stream: ${streamSid}`);
return;
}
// Reset stream state
stream.sequenceNumber = 1;
stream.chunkNumber = 1;
stream.audioBuffer = Buffer.alloc(0);
stream.lastTimestamp = null;
// Send acknowledgment
this.sendToExotel(streamSid, {
event: "clear_ack",
stream_sid: streamSid
});
console.log(`Audio buffer cleared for stream: ${streamSid}`);
}
4. Wrong Sample Rate Configuration
Problem: Incorrectly configured sample rates can cause audio playback issues or complete silence.
Solution: Implement sample rate validation:
function validateSampleConfiguration() {
const expectedInputRate = 8000; // Exotel input
const expectedOutputRate = 24000; // Gemini output
if (this.config.inputSampleRate !== expectedInputRate) {
throw new Error(`Invalid input sample rate: expected ${expectedInputRate}, got ${this.config.inputSampleRate}`);
}
if (this.config.outputSampleRate !== expectedOutputRate) {
throw new Error(`Invalid output sample rate: expected ${expectedOutputRate}, got ${this.config.outputSampleRate}`);
}
console.log("Sample rate configuration is valid");
}
By addressing these production considerations and avoiding common pitfalls, you can build a robust, high-performance voicebot system that delivers reliable real-time audio streaming between Exotel and Google Gemini.
Sources
- Exotel Stream and VoiceBot Applet Documentation — Technical specifications about WebSocket audio format and event handling: https://support.exotel.com/support/solutions/articles/3000108630-working-with-the-stream-and-voicebot-applet
- Google Gemini Live API Documentation — Implementation details for connecting and handling real-time audio: https://ai.google.dev/gemini-api/docs/live
- Exotel Pipecat AgentStream Guide — Production best practices for audio streaming and resampling: https://exotel.com/blog/exotel-pipecat-agentstream-guide/
- Exotel Quick Guide to Streaming Services — Minimum requirements and basic setup instructions for WebSocket connections: https://support.exotel.com/support/solutions/articles/3000132268-quick-guide-to-get-started-with-exotel-streaming-services
- Exotel AgentStream VoiceBot Applet Configuration — Detailed audio format specifications and bidirectional streaming requirements: https://docs.exotel.com/exotel-agentstream/voicebot-applet
- Exotel AgentStream Advanced Features — Advanced event handling and authentication methods beyond basic media events: https://docs.exotel.com/exotel-agentstream/advanced
- Exotel Voice Streaming Product Overview — High-level description of bidirectional streaming capabilities between Exotel and external platforms: https://exotel.com/products/voice-streaming/
Conclusion
Implementing real-time bidirectional audio streaming between Exotel Voicebot and Google Gemini Live API using WebSockets in Node.js requires careful attention to several critical factors. The node js websocket implementation must serve as a robust backend relay that handles audio format conversion between Exotel’s 8 kHz PCM and Gemini’s 24 kHz PCM output. The Exotel WebSocket URL carries all live audio payload, making the backend architecture essential for proper bidirectional streaming.
Production-quality implementations must address audio resampling with specialized libraries like SOX, FFmpeg, or Speex rather than simple decimation methods. Proper chunk size optimization (3.2-5 kB range), buffering strategies (200-300ms), timestamp synchronization are crucial for achieving low-latency, smooth playback. Common pitfalls include chunk size misalignment, insufficient resampling quality, missing clear events, wrong sample rate configuration, authentication problems.