Building a Real-Time Speech API for Industrial Voice Assistants

Published on June 11, 2025

Subscribe to the Newsletter

Join other readers for the latest posts and insights on AI, MLOps, and best practices in software design.

Over the past few months, while working on our intelligent voice assistant Sabot, we found ourselves needing a solid solution for speech-to-text (STT) and text-to-speech (TTS). And not just any solution - we needed something that would run locally, work reliably in industrial environments, and integrate tightly with our assistant's natural language interface.

As a team, we decided to build it ourselves.

In this post, I’ll walk you through how we ended up creating our real-time Speech API, what the architecture looks like, and what we learned along the way.

The Backstory: Sabot and the Role of Speech

We’ve been developing Sabot, our intelligent voice assistant designed specifically for industrial machines. With Sabot, machine operators can speak naturally to their machines - give commands, ask questions, receive updates, all via voice.

But for Sabot to be truly usable in a factory setting, we needed high-quality speech capabilities that could work offline and be hosted on-premise for privacy and latency reasons and also to give our customers full control over the solution.

That's how the project started. What began as a need within Sabot eventually became a standalone component: a modular Speech API that handles both real-time transcription and voice synthesis, which we now use in Sabot and can reuse in future customer projects.

What we Built

Here's a simple top-level view of the system:

Speech API Service Cover Image — Figure 1: An illustration of the Speech API Service

We designed the API to ensure it could transcribe live audio from a microphone in real time, convert text responses into natural-sounding speech, be efficient with low latency and language awareness, integrate easily with systems like Sabot, and run entirely on local infrastructure. It's now part of our speech pipeline in Sabot, but we've built it in a way that it can be reused across multiple setups.

System Architecture

Under the hood, the architecture is more modular and scalable. Here’s the full system breakdown:

Figure 2: An illustration of the Speech API Service Architecture and its components

Our speech processing system operates across three distinct architectural layers:

🎙️ Client Layer

Running on the user's device

The client layer handles all user-facing interactions through a coordinated set of components. The Audio Recorder captures and formats raw microphone input, converting it to PCM for processing. Real-time communication happens through our WebSocket Client, which streams audio chunks to the backend while the Recognition Result Receiver handles incoming transcribed text.

For the reverse journey, the Request Processor and REST Client work together to send text for synthesis and manage the resulting audio playback through the Audio Player, which outputs speech directly to the device's speaker.

⚡ Speech API

The processing powerhouse

This is where the main processing happens. Voice Activity Detection (VAD) acts as our intelligent gatekeeper, filtering out silence and background noise to ensure only meaningful speech segments reach our recognition engine. This optimization keeps the entire system both efficient and accurate.

Our Speech Processor and Recognition Engine primarily rely on OpenAI's Whisper models, chosen for their exceptional speed, accuracy, and multilingual support. These models can be fine-tuned for specific vocabularies, making them perfect for specialized applications.

The text-to-speech conversion relies on StyleTTS2 architecture, an open-source solution that supports multiple voice profiles. From various male and female tones to different speaking styles, this creates a genuinely human and pleasant listening experience. The Audio Streamer completes the cycle by streaming generated audio back to clients for immediate playback.

🗂️ Model Registry

Modular model management

Our registry system maintains all models in an organized, accessible format that enables seamless switching and upgrades. This architectural choice keeps our system beautifully modular, whether we're adding support for new voices, switching between languages, or optimizing for specific hardware configurations. The registry manages both STT models for speech recognition tasks and TTS models for synthesis, each supporting various voices and styles to meet different application needs.

End-to-End Flow

The system works seamlessly from the moment an operator speaks until the response is heard. For speech-to-text processing, the operator speaks into a microphone and the audio is streamed to the backend. Voice Activity Detection (VAD) filters out silence and background noise, and the recognition engine transcribes the relevant parts. The recognized text is then returned to the client.

Figure 3: An illustration of the Speech to Text Flow

For text-to-speech conversion, once the client sends text to the backend, it is processed and passed to the synthesizer engine. The engine generates audio, streams it back to the client, and the speaker plays the response in real time.

Figure 4: An illustration of the Text to Speech Flow

Lessons Learned

Building this system taught us several key insights that shaped our approach to real-time voice processing. Voice Activity Detection (VAD) proved absolutely critical - without it, our Speech-to-Text engine wasted computational resources on noise and silence, but the right VAD implementation dramatically improved performance. Streaming via WebSockets delivered the lowest latency for real-time STT, essential for maintaining smooth user experience. Our decision to separate models through a registry pattern made future upgrades seamless, allowing us to swap models without touching core API logic while keeping the system modular.

Surprisingly, voice style significantly affects user experience even in industrial contexts. A well-chosen voice profile makes the system feel natural rather than robotic, which users appreciate more than expected. Finally, we discovered that fine-tuning both STT and TTS models for specific vocabularies or accents is highly effective, opening exciting possibilities I plan to explore in future posts.

What’s Next?

We’re continuing to refine and expand this system. On our roadmap:

Tuning STT models for specific domains (automotive, manufacturing, etc.)
Adding branded voice personalities for machines
Integrating it more deeply into multimodal assistants that combine voice, screen, and gesture interfaces

Final Thoughts

What started as a subcomponent of Sabot turned into a real-time Speech API service that is fully reusable across our projects and can be adapted for various industrial applications.

Building this Speech API from scratch using open-source models was a great experience for our team. It gave us the flexibility and control we needed while also allowing us to create a solution that fits our specific requirements.

I hope this post gives you some insight into how we approached building our Speech API and the challenges we faced. If you’re working on similar projects or want to learn more about the system, feel free to reach out.

Thanks for reading!

– Richard