Crafting a Real-Time Voice Agent: A Complete Guide
Introduction
In this guide, we’ll create a real-time voice agent that mimics the functionality of contemporary low-latency conversational systems. we’ll detail the entire process, from processing segmented audio input and working with streaming speech recognition, to implementing incremental language model reasoning and delivering real-time text-to-speech output. A key focus will be on managing and tracking latency throughout the system, ensuring a responsive user experience.
Understanding Latency in Voice Agents
Latency is critical when designing voice-based interfaces. It refers to the delay between receiving audio input and providing audible output. We can break down this latency into several components, including: (CoinDesk)
- Time to first token
- Time to first audio
- Overall response time
By concentrating on strict latency budgets, we’ll navigate the engineering compromises that influence how users perceive voice interactions. You might also enjoy our guide on Nvidia’s Shift: A New Era in AI Hardware.
Setting Up the Environment
Before we jump into the coding, let’s ensure you’ve everything set up. You’ll need:
- Python installed on your machine.
- Libraries: numpy, asyncio, and matplotlib for visualizations.
- Familiarity with asynchronous programming in Python.
You can find the installations and documentation for these libraries here.
Designing Latency Metrics
We start by defining the core data structures that will help us track latency throughout the voice processing pipeline. Our LatencyMetrics class will capture various timing signals, while LatencyBudgets will specify acceptable latency thresholds for different stages.
from dataclasses import dataclass
@dataclass
class LatencyMetrics:
audio_chunk_received: float = 0.0
asr_started: float = 0.0
asr_partial: float = 0.0
asr_complete: float = 0.0
llm_started: float = 0.0
llm_first_token: float = 0.0
llm_complete: float = 0.0
tts_started: float = 0.0
tts_first_chunk: float = 0.0
tts_complete: float = 0.0
@dataclass
class LatencyBudgets:
asr_processing: float = 0.1
asr_finalization: float = 0.3
llm_first_token: float = 0.5
llm_token_generation: float = 0.02
tts_first_chunk: float = 0.2
tts_chunk_generation: float = 0.05
time_to_first_audio: float = 1.0
Streaming Audio Input
Next, we need to simulate audio input. Our AudioInputStream class will handle streaming audio in fixed chunks, allowing us to mimic how a real microphone input behaves. This simulation is necessary for testing our latency-sensitive components.
class AudioInputStream:
def __init__(self, sample_rate: int = 16000, chunk_duration_ms: int = 100):
self.sample_rate = sample_rate
self.chunk_duration_ms = chunk_duration_ms
self.chunk_size = int(sample_rate * chunk_duration_ms / 1000)
async def stream_audio(self, text: str) -> AsyncIterator[np.ndarray]:
chars_per_second = (150 * 5) / 60
duration_seconds = len(text) / chars_per_second
num_chunks = int(duration_seconds * 1000 / self.chunk_duration_ms)
for _ in range(num_chunks):
chunk = np.random.randn(self.chunk_size).astype(np.float32) * 0.1
await asyncio.sleep(self.chunk_duration_ms / 1000)
yield chunk
Implementing Streaming ASR
With the audio input in place, we can create our StreamingASR class for automatic speech recognition. This class will provide partial transcriptions while also detecting silence to signify the end of user input.
class StreamingASR:
def __init__(self, latency_budget: float = 0.1):
self.latency_budget = latency_budget
self.silence_threshold = 0.5
async def transcribe_stream(self, audio_stream: AsyncIterator[np.ndarray], ground_truth: str) -> AsyncIterator[tuple[str, bool]]:
words = ground_truth.split()
words_transcribed = 0
silence_duration = 0.0
chunk_count = 0
async for chunk in audio_stream:
chunk_count += 1
await asyncio.sleep(self.latency_budget)
if chunk_count % 3 == 0 and words_transcribed < len(words):
words_transcribed += 1
yield " ".join(words[:words_transcribed]), False
audio_power = np.mean(np.abs(chunk))
silence_duration = silence_duration + 0.1 if audio_power < 0.05 else 0.0
if silence_duration >= self.silence_threshold:
await asyncio.sleep(0.2)
yield ground_truth, True
return
yield ground_truth, True
Creating the Language Model
Next up is our StreamingLLM class that generates responses based on user input. This will add a layer of interaction, allowing our voice agent to sound more conversational.
class StreamingLLM:
def __init__(self, time_to_first_token: float = 0.3, tokens_per_second: float = 50):
self.time_to_first_token = time_to_first_token
self.tokens_per_second = tokens_per_second
async def generate_response(self, prompt: str) -> AsyncIterator[str]:
responses = {
"hello": "Hello! How can I help you today?",
"weather": "The weather is sunny with a temperature of 72°F.",
"time": "The current time is 2:30 PM.",
"default": "I understand. Let me help you with that."
}
response = responses["default"]
for key in responses:
if key in prompt.lower():
response = responses[key]
break
await asyncio.sleep(self.time_to_first_token)
for word in response.split():
yield word + " "
await asyncio.sleep(1.0 / self.tokens_per_second)
Implementing Text-to-Speech
Finally, we’ll integrate a StreamingTTS class to convert text responses into audio output. This allows our voice agent to speak back to the user, completing the conversational loop.
class StreamingTTS:
def __init__(self, time_to_first_chunk: float = 0.2, chars_per_second: float = 15):
self.time_to_first_chunk = time_to_first_chunk
self.chars_per_second = chars_per_second
async def synthesize_stream(self, text_stream: AsyncIterator[str]) -> AsyncIterator[np.ndarray]:
first_chunk = True
buffer = ""
async for text in text_stream:
buffer += text
if len(buffer) >= 20 or first_chunk:
if first_chunk:
await asyncio.sleep(self.time_to_first_chunk)
first_chunk = False
duration = len(buffer) / self.chars_per_second
yield np.random.randn(int(16000 * duration)).astype(np.float32) * 0.1
buffer = ""
await asyncio.sleep(duration * 0.5)
Putting It All Together
Now that we’ve built our individual components, let’s create the StreamingVoiceAgent class that combines all of them. This class will manage the states of the voice agent and process user input through each stage of the pipeline.
class StreamingVoiceAgent:
def __init__(self, latency_budgets: LatencyBudgets):
self.budgets = latency_budgets
self.audio_stream = AudioInputStream()
self.asr = StreamingASR(latency_budgets.asr_processing)
self.llm = StreamingLLM(
latency_budgets.llm_first_token,
1.0 / latency_budgets.llm_token_generation
)
self.tts = StreamingTTS(
latency_budgets.tts_first_chunk,
1.0 / latency_budgets.tts_chunk_generation
)
self.state = AgentState.LISTENING
async def process_turn(self, user_input: str) -> LatencyMetrics:
metrics = LatencyMetrics()
start_time = time.time()
metrics.audio_chunk_received = time.time() - start_time
audio_gen = self.audio_stream.stream_audio(user_input)
metrics.asr_started = time.time() - start_time
async for text, final in self.asr.transcribe_stream(audio_gen, user_input):
if final:
# Handle response and TTS generation here
pass
return metrics
Conclusion
We’ve just scratched the surface of building a voice agent that operates in real-time while carefully managing latency. The structure we’ve established can be expanded upon with further enhancements and features over time. For more in-depth information and the complete code, refer to the GitHub repository.
FAQs
what’s a voice agent?
A voice agent is a software application that uses voice recognition and speech synthesis technologies to interact with users through spoken language. For more tips, check out Bitcoin Struggles as Investors Face Uncertainty: A Closer Lo.
Why is latency important in voice interactions?
Latency affects user experience; high latency can make interactions feel slow and unresponsive, while low latency leads to smoother and more natural conversations. (Bitcoin.org)
What are the main components of a voice agent?
The primary components include audio input processing, automatic speech recognition (ASR), language model response generation, and text-to-speech (TTS) synthesis.
How can I improve the performance of my voice agent?
Optimizing algorithms, reducing processing time for each stage, and managing resource allocation effectively can enhance the overall performance of your voice agent.
Where can I find more information on voice agents?
You can read more about voice technologies and implementations on authoritative sites like MIT Technology Review and Towards Data Science.


