Floating point from scratch, hard mode

1775930917

Floating point from scratch, hard mode

Julia Desmazes spent half a decade afraid of floating point. On her return to the fight, she didn't just reimplement the arithmetic from scratch — she taped out the chip twice on real silicon. The resulting article is one of the most complete and honest technical pieces to appear on the web recently. **The problem with floating point** Most programmers know how to use floats. Few understand what happens inside. The IEEE 754 standard hides a zoo of edge cases: two zeros (+0 and -0), quiet and signaling NaNs, positive and negative infinity, subnormal numbers, five rounding modes with distinct boundary behaviors, and the breakdown of the law of trichotomy in comparisons involving NaN. The article unpacks each of these cases with C++ code examples and explains why every detail matters when building hardware. **From theory to transistor** After covering the theory, Desmazes moves on to the ASIC implementation. The chosen format was bfloat16, Google's 16-bit float with 8 exponent bits and 7 mantissa bits. Small enough to fit the chip's I/O constraints, easy to convert to and from float32, and with a short mantissa that dramatically reduces the cost of the multiplier in hardware. For the adder, she adopted the dual path adder architecture, the high-performance standard in FPUs since the 1980s. By restricting rounding to round-toward-zero only, she eliminated all round-up logic, along with NaN, infinity, and subnormal support. The result was a leaner, faster, and easier-to-verify implementation. **Exhaustive verification and a C++ betrayal** Testing floating point arithmetic is notoriously hard because edge cases stay invisible until you run into them. The solution was brute force: running all 2³² input pairs through Verilator as the simulator and the C++23 soft bfloat16_t as the golden model. The catch: the C++ standard implements bfloat16 internally as a truncated float32, with internal precision p=24 instead of p=8. This produces legitimate rounding differences between the model and the hardware, bounded to at most 1 ULP. The fix was adjusting the tests to accept that margin. **Two tapeouts** The main design was submitted to Tiny Tapeout's ihp26a shuttle on IHP 130nm, running at 100 MHz with full addition and multiplication in a single cycle. A second tapeout, in a 24-hour window on the ihp0p4 shuttle, turned into an fmax race: a two-stage pipelined multiplier with a custom Booth radix-4 multiplier hit 454.5 MHz. One interesting technical detail: replacing the tree-based LZC with a simple priority casez cut the logic to 3 levels deep, slightly faster and much easier to maintain. The synthesizer did the rest. The full article, with schematics, code, and timing analysis, is here: https://essenceia.github.io/projects/floating_dragon/

(0) Comments

Welcome to Chat-to.dev, a space for both novice and experienced programmers to chat about programming and share code in their posts.