Original title: How OpenAI delivers low-latency voice AI at scale
Article
OpenAI explains that real-time voice systems need low media latency, stable jitter and loss handling, and fast startup because users expect conversational turn-taking, especially with ChatGPT voice and Realtime API workloads. The team uses WebRTC as the client-facing protocol to reuse browser and mobile interoperability instead of reinventing NAT traversal, encryption, codec negotiation, and adaptive transport. For OpenAI’s predominantly 1:1, latency-sensitive traffic, they favor a transceiver architecture over an SFU because one service can hold all WebRTC session state while backend inference services stay stateless. Their initial Kubernetes-based approach exposed too many public UDP ports under one-port-per-session patterns, creating scaling, security, and ops issues. They replaced that with a stateless relay in front of transceivers: the relay has a small fixed public UDP surface, reads only initial STUN metadata such as ICE ufrag, forwards packets to the owning transceiver, and leaves ICE, DTLS, SRTP, and session lifecycle entirely on that stateful endpoint. Route recovery is supported with in-memory flow state plus Redis mappings when relays restart, and global relay ingress points plus proximity steering shorten first-hop latency for signaling and first media checks. Implementation details are intentionally narrow: Go-based services, shared UDP binding, SO_REUSEPORT, thread pinning, and minimal copy paths for throughput without kernel bypass complexity. The post frames this as the right choice for point-to-point voice AI at scale while preserving standard client expectations, though some commenters note the article’s 900-million-user framing may overstate how many users are in voice paths and whether that scale assumption is the right optimization target.
Readers broadly appreciated the technical transparency and references to Pion, Open-source tooling, and Go, and several shared related resources such as Pipecat while highlighting ongoing protocol and API interests like RFC 9297. A recurring theme was that lower latency alone can feel less natural if the model starts talking before users finish their thoughts, with users asking for better interruption handling and less premature responses. Several participants praised the utility of voice for shaping ideas but criticized current realtime models as less capable than frontier systems, repeatedly requesting stronger speech models beyond the 4o-family. Some commenters were skeptical about the cited user scale, arguing that total platform MAU is not the same as active voice users and that this distinction matters for infrastructure planning. Others contested technical assumptions, arguing that libwebrtc candidate and feature-flag tuning may solve many latency issues without major relay-level changes, and asking whether OpenAI still uses LiveKit and how sessions recover if a transceiver fails during a stream. At least one experienced developer asserted that relay and transceiver complexity is unnecessary for most end users if direct network conditions already permit first-class Kubernetes operation. Sentiment toward the product itself was mixed, with positive recognition of the engineering effort alongside strong frustration about output quality, verbosity, and conversational polish.