Most chatbots stick to one modality—either text or voice. But as someone who uses subtitles for everything, I wonder why voice bots don’t also include text for accessibility. Is it a limitation in the voice tech stack? Does text clutter the UI? To find out, I decided to build my own streaming-first chatbot interface with both text and voice.

Inspiration

I’m not the first to try this. 2024 is the year of multimodal AI. Amelia Wattnberger’s 2023 talk at AI Engineer still inspires me. So, let’s look at what’s already out there and draft some requirements for my project.

ChatGPT

I was speaking to ChatGPT, but you can’t tell from the picture.

Here’s my user journey:

I clicked the microphone icon to record my voice.
I spoke into my phone.
I clicked the icon again to stop.
I waited about a second.
I checked the transcription for errors.
I hit send.
I read ChatGPT’s streamed response.

It’s not a true voice interface, but it’s an easy way to add voice to a text-based bot. Most devices have on-device STT (speech-to-text) as an accessibility feature, and developers can leverage this as a way to prepend voice on an existing text interface.

This chat feels one-sided and slow; There are too many steps that inhibit a natural flow. But for one-offs or questions where I expect a long, code-heavy answer, it works.

Notably, ChatGPT’s response streams in, so I don’t have to wait for the full reply before I start reading. Humans read about 20 tokens per second. As long as the LLM beats that rate, uses will never get stuck waiting. Streaming cuts latency and makes the bot feel faster. But streaming is tough and requires a unique, ground-up approach.

ChatGPT (Again)

ChatGPT also has a proper voice mode. You can’t tell from the screenshot, but I’m speaking to ChatGPT, and it’s speaking back to me. Here’s how it works:

Press the button to start a voice chat.
Wait for the connection.
Speak your question.
Wait a second or two.
Listen to ChatGPT’s response.

You can stop or pause playback, or record a new message by tapping anywhere. [Removed excess details.]

I was also talking to chatGPT in this one but you can't really tell

This is closer to true voice interaction. The bot takes turns, responds quickly, and gives shorter answers with voice than it does with text. It doesn’t read code aloud, which makes sense, but it still includes links to sources, which is useful in the transcript. After the conversaion ends, I can view the chat transcript just as if it were a text chat. Any code blocks generated during the voice chat are available in this transcript. My ideal interface would show this transcript during the conversation itself.

ChatGPT’s audio seems to stream, starting to speak before the whole answer is ready. The user’s audio is likely to be streamed as well, since you can recover from pauses and interrupt the bot without any interactions aside from your voice.

Perplexity AI

screenshot from a search done with Perplexity

Perplexity AI focuses on research, and its UI reflects that with multimedia from sources and answers via text. It loads sources first, then streams the answer.

I love how it uses images—like the map in this example—to enhance its responses. It proves that multimodality adds real value to conversations.

Perplexity doesn’t have a voice mode, but I mention it because web artifacts are as much a communication modality as audio. Its interface is designed to display pictures, maps, and other multimedia elements alongside text, not audio. It is a specialized chat interface,but avoids falling into the trap of using chatbots for everything.

Livekit

screenshot from my chat with Kitt, the Livekit demo agent Livekit’s Kitt demo agent streams both text and audio for both users and LLMs. This is the exact functionality I want in my project, though I want to differ from Kitt in a few ways:

Kitt interrupts users. In my project, the user will hold down a key while speaking, and the bot will only respond when the key is released.
Livekit uses WebRTC for peer-to-peer connections, but I’m using websockets since they’re simpler for client-server apps.

Scoping the Project

After consolidating my research, I came up with these requirements:

Stream user audio for server-side transcription.
Stream transcription back in real time.
Stream the bot’s text response and render it incrementally.
Stream TTS audio alongside the text.

I chose React/TypeScript for the frontend and Go for the server. I’ve been learning Go since June, and this felt like a good first project.

Results

I built it, and it works! Deepgram STT is a bit wonky, and the TTS sounds terrible, but it meets every criteria I set.

What I Learned

Websockets are like tunnels—you shove data in one end, and it pops out the other.
There’s no true “streamed” TTS, only batched. You buffer text and send it in sentences to TTS.
STT works similarly: it records audio in chunks called timeslices. Pick a timeslice long enough to avoid cutting words, but short enough for quick reactions.
Once the whole audio clip is sent, you can re-transcribe for accuracy, but that’s not true streaming.
Picking timeslices for STT and chunking for TTS is a balancing act.
Streaming matters less with fast models. I had to slow down Groq Llama responses with an animation because they arrived too quickly.
Decide early whether to stream or batch. Switching later is a nightmare.

Next steps

Expect another post soon that dives into the application build process. I would also like to play around with the user interface a bit more, and experiment with different layouts.

❤️

Gordy

A Tour of Streaming Chat & Audio Interfaces