Node.js WebSocket Audio Streaming: Exotel to Gemini Integration

Question

How to Implement Real-Time Bidirectional Audio Streaming Between Exotel Voicebot and Google Gemini Live API Using WebSockets in Node.js?

Project Setup
Building a real-time AI voicebot in JavaScript with:
Exotel for outbound calling and live audio streaming via WebSocket API
Google Gemini Live API for speech-to-text and text-to-speech
Full-duplex streaming for simultaneous caller input and Gemini responses

Tech Stack
Backend: Node.js (Express + ws)
Exotel audio: 8 kHz PCM
Gemini audio output: 24 kHz PCM
Transport: WebSockets (Exotel ↔ Backend ↔ Gemini)

Key Challenges and Questions
Exotel Workflow WebSocket URL Role
What is the exact purpose of the WSS URL provided in the Exotel workflow?
Does it handle only call control, or does all audio flow through it?
If Exotel streams start, media, and stop events separately, is the WSS URL just a trigger?
Recommended Streaming Architecture
Is the standard flow always Exotel → Backend → Gemini → Backend → Exotel?
Is direct streaming from Exotel to Gemini possible without a backend relay?
Audio Resampling from 24 kHz to 8 kHz
Current method: Simple decimation (every 3rd sample)
Is this sufficient for production voicebots, or use libraries like SOX, FFmpeg, or Speex?
Real-Time Playback Issues (Delays/Choppiness)
Suspected causes: chunk size, buffering, timestamps, sequence_number
Best practices for smooth, low-latency playback with Exotel?

Node.js Backend Code (Simplified)

Seeking
Correct Exotel streaming architecture confirmation
Workflow WSS URL clarification
Low-latency bidirectional audio best practices
Common Exotel voicebot pitfalls and real-world examples

NeuroAnswers · Accepted Answer

Implementing real-time bidirectional audio streaming between Exotel Voicebot and Google Gemini Live API requires a backend WebSocket relay in Node.js that handles audio format conversion between Exotel's 8 kHz PCM and Gemini's 24 kHz PCM output. The Exotel WebSocket URL carries all live audio payload, not just call control events, making the backend relay architecture essential for proper bidirectional streaming. Audio resampling from 24 kHz to 8 kHz requires production-quality techniques beyond simple decimation to maintain voice clarity and avoid choppiness in real-time applications.

Contents
Understanding Exotel and Google Gemini Live API Integration Architecture
Setting Up Node.js WebSocket Server for Bidirectional Audio Streaming
Implementing Production-Grade Audio Resampling Between 24 kHz and 8 kHz PCM
Optimizing Real-Time Playback with Exotel WebSocket Events
Production-Ready Implementation and Common Pitfalls

Understanding Exotel and Google Gemini Live API Integration Architecture

When implementing a real-time AI voicebot that bridges Exotel and Google Gemini Live API, understanding the integration architecture is crucial. The Exotel WebSocket URL provided in the Voicebot Applet serves as the primary bidirectional channel for all audio data, not just a trigger for call control events. This means that once a call is established, all audio streams from the caller to the voicebot and from the voicebot back to the caller flow through this WebSocket connection.

According to Exotel's documentation, the WebSocket connection handles several event types, including start, media, stop, dtmf, mark, and clear events. The start event initializes the stream with a unique stream_sid, while media events contain the actual audio payload in base64-encoded PCM format. The stop event terminates the connection, and clear events are particularly important for managing audio buffers and preventing gaps in the conversation flow.

Regarding the streaming architecture, direct streaming from Exotel to Google Gemini Live API is not possible due to several technical constraints. Exotel's audio format requirements (8 kHz PCM, 16-bit, mono, little-endian) differ significantly from Gemini's optimal input/output (24 kHz PCM). Additionally, authentication, event handling, and audio processing require a backend relay to properly manage the bidirectional flow.

The standard production architecture follows this pattern: Exotel → Backend → Gemini → Backend → Exotel. This backend relay serves several critical functions:
Audio format conversion between Exotel's 8 kHz and Gemini's 24 kHz PCM
Proper authentication and session management
Event handling and sequence number management
Audio buffering and chunk size optimization
Error recovery and reconnection logic

Authentication methods for Exotel WebSocket connections typically include IP whitelisting or Basic Auth, as detailed in Exotel's advanced streaming documentation. This ensures that only authorized backend services can establish connections to the streaming endpoint.

Setting Up Node.js WebSocket Server for Bidirectional Audio Streaming

To implement a production-ready WebSocket server for bidirectional audio streaming between Exotel and Google Gemini, you'll need to properly configure your Node.js environment with the necessary dependencies and implement robust connection handling. The implementation should focus on three core areas: connection management, audio processing, and event handling.

First, let's establish the necessary dependencies and server configuration:

The WebSocket server must be configured to handle multiple simultaneous connections, each representing an active call session. Each connection requires proper session management to track the stream_sid, sequence numbers, and audio buffers.

The event handlers must be carefully implemented to properly manage the bidirectional audio flow. The start handler initializes the Gemini session, while the media handler processes incoming audio and forwards it to Gemini.

This implementation establishes the foundation for bidirectional audio streaming. The key is ensuring that all audio events are properly routed between Exotel and Gemini while maintaining connection state and managing sequence numbers correctly.

Implementing Production-Grade Audio Resampling Between 24 kHz and 8 kHz PCM

Audio resampling represents one of the most critical components in the bidirectional streaming architecture between Exotel and Google Gemini. Exotel requires 8 kHz PCM audio, while Gemini produces 24 kHz PCM output. The simple decimation method (taking every 3rd sample) in the provided code is insufficient for production-quality voicebots, as it introduces aliasing artifacts and degrades voice clarity.

For production environments, you should implement professional audio resampling using specialized libraries. Three primary approaches are available: SOX, FFmpeg, and Speex. Each has different performance characteristics and quality levels suitable for different use cases.

SOX-Based Resampling

SOX (Sound eXchange) is a comprehensive command-line audio processing tool that provides high-quality resampling with configurable algorithms:

The rate -h flag in SOX uses a high-quality algorithm suitable for voice applications. While SOX provides excellent quality, it introduces additional latency due to file I/O operations, making it less suitable for ultra-low-latency applications.

FFmpeg-Based Resampling

FFmpeg offers a more efficient approach with lower latency while maintaining high quality:

FFmpeg typically provides better performance than SOX for streaming applications, with lower overhead and more efficient processing of audio streams.

Speex-Based Resampling

For the lowest latency applications, Speex (a specialized voice codec library) offers optimized resampling specifically designed for voice:

Speex provides the best performance for real-time voice applications with minimal latency, making it ideal for voicebot scenarios where responsiveness is critical.

Integrated Resampling Function

For production use, you should integrate these resampling methods with proper error handling and fallback mechanisms:

This integrated approach provides production-quality audio resampling with fallback mechanisms, ensuring your voicebot maintains voice clarity even if the primary resampling method fails.

Optimizing Real-Time Playback with Exotel WebSocket Events

Achieving smooth, low-latency playback in an Exotel-Gemini integration requires careful optimization of several factors: chunk size alignment, proper buffering strategies, timestamp synchronization, and robust event handling. The primary challenges include minimizing latency while preventing audio gaps, choppiness, or delays that degrade the user experience.

Chunk Size Optimization

Exotel recommends chunk sizes between 3.2 kB and 100 kB for WebSocket audio streaming, with optimal performance typically achieved in the 3.2-5 kB range. The current implementation uses a fixed chunk size of 1.6 kB (320 samples × 5), which may be too small for efficient streaming. Increasing the chunk size while staying within the optimal range can significantly improve performance:

Buffering and Timing Strategies

Proper buffering is essential to prevent audio gaps while maintaining low latency. A 200-300ms buffer provides a good balance, allowing for network fluctuations while maintaining responsiveness:

Event Handling and Sequence Management

Proper handling of Exotel's event types, particularly the clear event, is critical for maintaining audio continuity:

Latency Measurement and Optimization

Implementing latency measurement allows you to identify and address performance bottlenecks:

This optimized implementation addresses the common issues of latency, choppiness, and audio gaps by implementing proper chunk sizing, buffering strategies, timestamp synchronization, and latency monitoring.

Production-Ready Implementation and Common Pitfalls

When deploying a production voicebot system that bridges Exotel and Google Gemini, several critical considerations must be addressed to ensure reliability, performance, and user experience. This section covers the essential production-ready components and highlights common pitfalls that can degrade performance or cause system failures.

Production Architecture Components

A production-ready implementation should include several key components beyond the basic streaming functionality:

Error Handling and Recovery

Robust error handling is essential for production systems:

Common Pitfalls and Solutions
Chunk Size Misalignment

Problem: Using chunk sizes that don't align with Exotel's expectations can cause streaming issues.

Solution: Implement chunk size validation and adjustment:
Insufficient Audio Resampling Quality

Problem: Simple decimation methods produce poor-quality audio that sounds robotic or has artifacts.

Solution: Implement proper resampling with quality fallbacks:
Missing Clear Events

Problem: Failing to properly handle clear events can cause audio buffer overflow and gaps in conversation.

Solution: Implement proper clear event handling:
Wrong Sample Rate Configuration

Problem: Incorrectly configured sample rates can cause audio playback issues or complete silence.

Solution: Implement sample rate validation:

By addressing these production considerations and avoiding common pitfalls, you can build a robust, high-performance voicebot system that delivers reliable real-time audio streaming between Exotel and Google Gemini.

Sources
Exotel Stream and VoiceBot Applet Documentation — Technical specifications about WebSocket audio format and event handling: https://support.exotel.com/support/solutions/articles/3000108630-working-with-the-stream-and-voicebot-applet
Google Gemini Live API Documentation — Implementation details for connecting and handling real-time audio: https://ai.google.dev/gemini-api/docs/live
Exotel Pipecat AgentStream Guide — Production best practices for audio streaming and resampling: https://exotel.com/blog/exotel-pipecat-agentstream-guide/
Exotel Quick Guide to Streaming Services — Minimum requirements and basic setup instructions for WebSocket connections: https://support.exotel.com/support/solutions/articles/3000132268-quick-guide-to-get-started-with-exotel-streaming-services
Exotel AgentStream VoiceBot Applet Configuration — Detailed audio format specifications and bidirectional streaming requirements: https://docs.exotel.com/exotel-agentstream/voicebot-applet
Exotel AgentStream Advanced Features — Advanced event handling and authentication methods beyond basic media events: https://docs.exotel.com/exotel-agentstream/advanced
Exotel Voice Streaming Product Overview — High-level description of bidirectional streaming capabilities between Exotel and external platforms: https://exotel.com/products/voice-streaming/

Conclusion

Implementing real-time bidirectional audio streaming between Exotel Voicebot and Google Gemini Live API using WebSockets in Node.js requires careful attention to several critical factors. The node js websocket implementation must serve as a robust backend relay that handles audio format conversion between Exotel's 8 kHz PCM and Gemini's 24 kHz PCM output. The Exotel WebSocket URL carries all live audio payload, making the backend architecture essential for proper bidirectional streaming.

Production-quality implementations must address audio resampling with specialized libraries like SOX, FFmpeg, or Speex rather than simple decimation methods. Proper chunk size optimization (3.2-5 kB range), buffering strategies (200-300ms), timestamp synchronization are crucial for achieving low-latency, smooth playback. Common pitfalls include chunk size misalignment, insufficient resampling quality, missing clear events, wrong sample rate configuration, authentication problems.

Node.js WebSocket Audio Streaming: Exotel to Gemini Integration

How to Implement Real-Time Bidirectional Audio Streaming Between Exotel Voicebot and Google Gemini Live API Using WebSockets in Node.js?

Project Setup

Tech Stack

Key Challenges and Questions

1. Exotel Workflow WebSocket URL Role

2. Recommended Streaming Architecture

3. Audio Resampling from 24 kHz to 8 kHz

4. Real-Time Playback Issues (Delays/Choppiness)

Node.js Backend Code (Simplified)

Seeking

Contents