OpenAI's MRC protocol makes cluster networking the latest frontier in the training efficiency race

OpenAI has released Multipath Reliable Connection, or MRC, an open networking protocol co-developed with AMD, Broadcom, Intel, Microsoft, and Nvidia over two years and published through the Open Compute Project, targeting the specific problem of wasted GPU time caused by network congestion and link failures during frontier-scale AI training runs.

The technical problem MRC solves is underappreciated outside of infrastructure teams. Traditional cluster networking protocols like InfiniBand and RoCEv2 route data along a fixed path or use simple static load balancing. At the scale of 100,000-plus GPU training runs, any congested link or failed switch stalls the entire job until the issue resolves or the system reroutes. Greg Brockman noted on X that at Stargate scale, a single slow link can idle thousands of GPUs simultaneously. A 0.5 percent daily node failure rate in a 100,000-GPU cluster means 500 GPUs going dark every 24 hours. Whether training completes in three months or six months often comes down entirely to how quickly the network detects, isolates, and reroutes around those failures, not how many GPUs you started with.

MRC addresses this through multipath packet spraying and SRv6 source routing. Instead of assigning each data flow to a single path, MRC spreads individual packets across hundreds of network paths simultaneously. If one path becomes congested or a link flaps, packets already in flight continue arriving via other routes. The protocol operates on a lossy fabric with reliable delivery at the connection layer, which is different from InfiniBand's lossless credit-based approach. That distinction matters for cost: MRC is optimised for commodity Ethernet hardware rather than proprietary InfiniBand interconnects, making it viable for hyperscalers and cloud providers building clusters at commercial scale. OpenAI has deployed it across all of its largest supercomputers, including the NVIDIA GB200-based Stargate facility in Abilene, Texas, and Microsoft's Fairwater cluster.

The performance improvement is real but difficult to quantify from outside. OpenAI states that MRC maintains training progress without measurable disruption in many cases during link flap events, which is meaningfully better than the 30 percent slowdown that traditional single-path RDMA can suffer under network congestion. The company trained multiple frontier models on MRC-equipped infrastructure before publishing the protocol. That production deployment record is what separates MRC from a research paper: it has already run at the scale it claims to address.

For SF readers, training efficiency is now a capital strategy as much as a technical one. The economics are straightforward. Frontier model training at Stargate scale costs roughly $20 million to $30 million per week in compute. A 10 percent improvement in cluster utilisation from reduced idle GPU time is $2 million to $3 million per week in saved compute, or several hundred million dollars over a multi-month training run. Even a 3 to 5 percent utilisation gain at that scale outweighs the engineering cost of developing the protocol by a large margin. Every percentage point of efficiency translates directly into faster model release cadence, lower marginal cost per model, and more training budget for capability improvements rather than restarting stalled jobs.

The open ecosystem versus proprietary advantage question is where MRC's release through the Open Compute Project becomes strategically interesting. Publishing a networking protocol as an open standard benefits OpenAI in several ways simultaneously. It encourages hardware vendors to implement MRC in their network interface cards and switches, which expands the available infrastructure OpenAI can use. It establishes OpenAI as a contributor to industry standards rather than a closed system, which matters for government procurement and enterprise customer trust. It also makes competing frontier labs more efficient, which sounds like a concession but actually benefits the broader ecosystem OpenAI depends on for talent, research, and commercial partnerships.

Smaller AI labs and startups should watch MRC closely for a different reason. Training efficiency tools historically commoditise within 12 to 18 months of open release: hardware vendors build support, cloud providers enable it by default, and the advantage accrues to whoever implements it fastest rather than whoever invented it. The startups and mid-size labs training on AWS, Azure, or GCP will get MRC through their cloud provider's networking stack without needing to build anything themselves. The real differentiation shifts back upstream to model architecture, data quality, and post-training methodology, exactly the layers where OpenAI already leads. MRC raises the floor for everyone while reinforcing the ceiling that only a handful of organisations can reach.

Also read: DeepL's 25 percent staff cut is Europe's AI translation leader adapting to generalist model pressure • IBM's Neel Sundaresan says most AI coding wastes frontier models on trivial tasks • Vibe coding is expanding the attack surface faster than any security team can monitor it