Crafting a Real-Time Voice Agent: A Complete Guide

How to Design a Fully Streaming Voice Agent with End-to-End Latency Budgets, Incremental ASR, LLM Streaming, and Real-Time TTS

Introduction

In this guide, we’ll create a real-time voice agent that mimics the functionality of contemporary low-latency conversational systems. we’ll detail the entire process, from processing segmented audio input and working with streaming speech recognition, to implementing incremental language model reasoning and delivering real-time text-to-speech output. A key focus will be on managing and tracking latency throughout the system, ensuring a responsive user experience.

Understanding Latency in Voice Agents

Latency is critical when designing voice-based interfaces. It refers to the delay between receiving audio input and providing audible output. We can break down this latency into several components, including: (CoinDesk)

Time to first token
Time to first audio
Overall response time

By concentrating on strict latency budgets, we’ll navigate the engineering compromises that influence how users perceive voice interactions. You might also enjoy our guide on Nvidia’s Shift: A New Era in AI Hardware.

Setting Up the Environment

Before we jump into the coding, let’s ensure you’ve everything set up. You’ll need:

Python installed on your machine.
Libraries: numpy, asyncio, and matplotlib for visualizations.
Familiarity with asynchronous programming in Python.

You can find the installations and documentation for these libraries here.

Designing Latency Metrics

We start by defining the core data structures that will help us track latency throughout the voice processing pipeline. Our LatencyMetrics class will capture various timing signals, while LatencyBudgets will specify acceptable latency thresholds for different stages.

from dataclasses import dataclass

@dataclass
class LatencyMetrics:
    audio_chunk_received: float = 0.0
    asr_started: float = 0.0
    asr_partial: float = 0.0
    asr_complete: float = 0.0
    llm_started: float = 0.0
    llm_first_token: float = 0.0
    llm_complete: float = 0.0
    tts_started: float = 0.0
    tts_first_chunk: float = 0.0
    tts_complete: float = 0.0

@dataclass
class LatencyBudgets:
    asr_processing: float = 0.1
    asr_finalization: float = 0.3
    llm_first_token: float = 0.5
    llm_token_generation: float = 0.02
    tts_first_chunk: float = 0.2
    tts_chunk_generation: float = 0.05
    time_to_first_audio: float = 1.0

Streaming Audio Input

Next, we need to simulate audio input. Our AudioInputStream class will handle streaming audio in fixed chunks, allowing us to mimic how a real microphone input behaves. This simulation is necessary for testing our latency-sensitive components.

class AudioInputStream:
    def __init__(self, sample_rate: int = 16000, chunk_duration_ms: int = 100):
        self.sample_rate = sample_rate
        self.chunk_duration_ms = chunk_duration_ms
        self.chunk_size = int(sample_rate * chunk_duration_ms / 1000)

    async def stream_audio(self, text: str) -> AsyncIterator[np.ndarray]:
        chars_per_second = (150 * 5) / 60
        duration_seconds = len(text) / chars_per_second
        num_chunks = int(duration_seconds * 1000 / self.chunk_duration_ms)
        for _ in range(num_chunks):
            chunk = np.random.randn(self.chunk_size).astype(np.float32) * 0.1
            await asyncio.sleep(self.chunk_duration_ms / 1000)
            yield chunk

Implementing Streaming ASR

With the audio input in place, we can create our StreamingASR class for automatic speech recognition. This class will provide partial transcriptions while also detecting silence to signify the end of user input.

class StreamingASR:
    def __init__(self, latency_budget: float = 0.1):
        self.latency_budget = latency_budget
        self.silence_threshold = 0.5

    async def transcribe_stream(self, audio_stream: AsyncIterator[np.ndarray], ground_truth: str) -> AsyncIterator[tuple[str, bool]]:
        words = ground_truth.split()
        words_transcribed = 0
        silence_duration = 0.0
        chunk_count = 0
        async for chunk in audio_stream:
            chunk_count += 1
            await asyncio.sleep(self.latency_budget)
            if chunk_count % 3 == 0 and words_transcribed < len(words):
                words_transcribed += 1
                yield " ".join(words[:words_transcribed]), False
            audio_power = np.mean(np.abs(chunk))
            silence_duration = silence_duration + 0.1 if audio_power < 0.05 else 0.0
            if silence_duration >= self.silence_threshold:
                await asyncio.sleep(0.2)
                yield ground_truth, True
                return
        yield ground_truth, True

Creating the Language Model

Next up is our StreamingLLM class that generates responses based on user input. This will add a layer of interaction, allowing our voice agent to sound more conversational.

class StreamingLLM:
    def __init__(self, time_to_first_token: float = 0.3, tokens_per_second: float = 50):
        self.time_to_first_token = time_to_first_token
        self.tokens_per_second = tokens_per_second

    async def generate_response(self, prompt: str) -> AsyncIterator[str]:
        responses = {
            "hello": "Hello! How can I help you today?",
            "weather": "The weather is sunny with a temperature of 72°F.",
            "time": "The current time is 2:30 PM.",
            "default": "I understand. Let me help you with that."
        }
        response = responses["default"]
        for key in responses:
            if key in prompt.lower():
                response = responses[key]
                break
        await asyncio.sleep(self.time_to_first_token)
        for word in response.split():
            yield word + " "
            await asyncio.sleep(1.0 / self.tokens_per_second)

Implementing Text-to-Speech

Finally, we’ll integrate a StreamingTTS class to convert text responses into audio output. This allows our voice agent to speak back to the user, completing the conversational loop.

class StreamingTTS:
    def __init__(self, time_to_first_chunk: float = 0.2, chars_per_second: float = 15):
        self.time_to_first_chunk = time_to_first_chunk
        self.chars_per_second = chars_per_second

    async def synthesize_stream(self, text_stream: AsyncIterator[str]) -> AsyncIterator[np.ndarray]:
        first_chunk = True
        buffer = ""
        async for text in text_stream:
            buffer += text
            if len(buffer) >= 20 or first_chunk:
                if first_chunk:
                    await asyncio.sleep(self.time_to_first_chunk)
                    first_chunk = False
                duration = len(buffer) / self.chars_per_second
                yield np.random.randn(int(16000 * duration)).astype(np.float32) * 0.1
                buffer = ""
                await asyncio.sleep(duration * 0.5)

Putting It All Together

Now that we’ve built our individual components, let’s create the StreamingVoiceAgent class that combines all of them. This class will manage the states of the voice agent and process user input through each stage of the pipeline.

class StreamingVoiceAgent:
    def __init__(self, latency_budgets: LatencyBudgets):
        self.budgets = latency_budgets
        self.audio_stream = AudioInputStream()
        self.asr = StreamingASR(latency_budgets.asr_processing)
        self.llm = StreamingLLM(
            latency_budgets.llm_first_token,
            1.0 / latency_budgets.llm_token_generation
        )
        self.tts = StreamingTTS(
            latency_budgets.tts_first_chunk,
            1.0 / latency_budgets.tts_chunk_generation
        )
        self.state = AgentState.LISTENING

    async def process_turn(self, user_input: str) -> LatencyMetrics:
        metrics = LatencyMetrics()
        start_time = time.time()
        metrics.audio_chunk_received = time.time() - start_time
        audio_gen = self.audio_stream.stream_audio(user_input)
        metrics.asr_started = time.time() - start_time
        async for text, final in self.asr.transcribe_stream(audio_gen, user_input):
            if final:
                # Handle response and TTS generation here
                pass
        return metrics

Conclusion

We’ve just scratched the surface of building a voice agent that operates in real-time while carefully managing latency. The structure we’ve established can be expanded upon with further enhancements and features over time. For more in-depth information and the complete code, refer to the GitHub repository.

FAQs

what’s a voice agent?

A voice agent is a software application that uses voice recognition and speech synthesis technologies to interact with users through spoken language. For more tips, check out Bitcoin Struggles as Investors Face Uncertainty: A Closer Lo.

Why is latency important in voice interactions?

Latency affects user experience; high latency can make interactions feel slow and unresponsive, while low latency leads to smoother and more natural conversations. (Bitcoin.org)

What are the main components of a voice agent?

The primary components include audio input processing, automatic speech recognition (ASR), language model response generation, and text-to-speech (TTS) synthesis.

How can I improve the performance of my voice agent?

Optimizing algorithms, reducing processing time for each stage, and managing resource allocation effectively can enhance the overall performance of your voice agent.