1775207288

How Microsoft Nearly Lost a Trillion Dollars From the Inside


*A senior Azure engineer exposes the behind-the-scenes story of one of the most silent and costly crises in recent cloud computing history.* --- When Axel Rietschin arrived at Microsoft's headquarters in Redmond on the morning of May 1st, 2023, he was anything but a newcomer. He had spent years making direct contributions to the technologies underpinning Azure, with stints on the Windows team, SharePoint Online, and Core OS, where he helped invent the container platform that powers Docker, Kubernetes, and Windows Sandbox. What he did not expect was to find an entire organization planning the impossible as if it were routine. --- ## The First Day That Revealed Everything Rietschin had barely arrived when he was invited to a monthly planning meeting. In the room were leads, architects, and senior engineers. On the screen, a slide packed with familiar acronyms like COM, WMI, VHDX, and ETW, all connected by arrows in a tangle that was difficult to parse. What was being presented was a plan to port that entire stack of Windows components onto the Overlake chip, a tiny fanless ARM SoC the size of a fingernail, designed to consume as little power and memory as possible. A chip where the hardware engineers had reserved just 4KB of dual-ported FPGA memory for communication protocols. Rietschin knew the hardware inside out. He knew the idea was unworkable. But what surprised him most was not the proposal itself. It was the seriousness with which it was received. Nobody in the room questioned it. A Principal Engineering Manager suggested having "a couple of junior developers look into it." --- ## 173 Agents and No Explanation In the days that followed, Rietschin deepened his understanding of the environment. One of the most unsettling discoveries came from a conversation with the head of Microsoft's Linux group: there were 173 software agents identified as candidates to run inside the Overlake chip. For context, Azure at its core sells virtual machines, networking, and storage. With observability and servicing on top, that should require a small number of well-defined central processes. How they arrived at 173 is something that, according to Rietschin himself, will probably never be fully explained. Nobody at Microsoft could articulate what all those agents did, why they existed, or how they interacted with one another. But the problem goes beyond organizational confusion. Those agents were what orchestrated the virtual machines running OpenAI's systems, SharePoint Online, United States government clouds, and other mission-critical infrastructure. A failure there is not just a bug. Depending on the context, it is a collapse with national security implications. --- ## The Real Cost of Technical Complacency The software stack Rietschin encountered was hitting its limits at just a few dozen VMs per node, in an environment where the hypervisor was capable of supporting over a thousand. On top of that, it was consuming enough host server resources to cause noticeable instability in customer VMs, the so-called "noisy neighbor" problem. All of this was happening while Microsoft was in the middle of a historic bet on OpenAI, providing the infrastructure for the most widely used language models in the world. The fragility was not just technical. It was strategic, financial, and at certain moments, a matter of institutional trust. Rietschin says he tried to alert leadership, including the CEO, the Microsoft board, and senior executives in the Cloud and AI division. The silence he received in return is a central part of the story he is telling across a series of articles published on Substack. --- ## What This Means for Azure Users The most important revelation for any company or developer relying on Azure is not Microsoft's internal drama. It is the realization that critical infrastructure can be held together by systems nobody fully understands, planned by teams that had lost touch with the technical reality of what they were building. Rietschin is not saying Azure is insecure today. He is saying that for a considerable period, decisions were made with an alarming distance from real engineering, and that the consequences of that disconnect are still unfolding. The series continues. The near-loss of OpenAI as a customer, the letters sent to the CEO, the incidents involving the US government, and the features promised publicly before the work had even begun are all coming in the next chapters. Worth following. --- **Source:** [How Microsoft Vaporized a Trillion Dollars](https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion)

(1) Comments
exterminator
exterminator
1775207960

This is the kind of story that should be all over every tech news outlet right now, but somehow it's flying under the radar. A senior engineer walks in on day one and finds a 122-person org seriously planning to port half of Windows onto a chip the size of a fingernail, with 4KB of memory. And nobody in the room blinked. <br>The 173 agents detail is what really gets me. Not 10. Not 30. One hundred and seventy-three background processes running on every single Azure node, and apparently nobody could explain what most of them actually did. This is the infrastructure running OpenAI, US government clouds, and some of the most critical systems on the planet. <br>We talk a lot about AI safety and model reliability, but the real risk might be sitting a layer below all of that, in the plumbing nobody fully understands anymore. <br>Rietschin says he wrote to the CEO, to the board, to senior leadership. Silence. That part alone tells you everything about how these organizations handle uncomfortable truths from the inside. <br>If you use Azure for anything serious, this series is required reading. And if you don't use Azure, read it anyway, because the dynamics described here are not unique to Microsoft. <br>What part of this surprised you the most? For me it was the scaling limit. A hypervisor capable of 1,024 VMs per node, and the stack was choking at a few dozen. That gap is staggering.


Welcome to Chat-to.dev, a space for both novice and experienced programmers to chat about programming and share code in their posts.

About | Privacy | Donate
[2026 © Chat-to.dev]