

Writing About AI
Uvation
Reen Singh is an engineer and a technologist with a diverse background spanning software, hardware, aerospace, defense, and cybersecurity. As CTO at Uvation, he leverages his extensive experience to lead the company’s technological innovation and development.

The NVIDIA Blackwell Ultra (B300) GPU is a new standard for AI infrastructure, shifting the industry focus from simply adding more GPUs to maximizing efficiency, specifically measured by tokens-per-watt and cost-per-million-tokens. It is built to address the reality that traditional scaling strategies are no longer sufficient for the increased compute demands of modern AI models that reason, plan, and execute multi-step tasks. The B300 delivers higher tokens-per-watt for energy-efficient inference and a lower cost-per-million-tokens for sustainable margins, acting as a platform designed to make large-scale, high-efficiency AI deployments both practical and economically viable.
The Blackwell Ultra (B300) achieves dramatic improvements over Hopper by fundamentally changing its core architecture. Hopper utilized a monolithic GPU design (~80B transistors), whereas Blackwell Ultra moves to a dual-die unified GPU architecture, containing 208 billion transistors to break scaling limits. This dual-die setup functions as a single compute surface thanks to a 10 TB/s NV-HBI interconnect, a shared memory subsystem, and a synchronized execution model. Additionally, B300 introduces NVFP4, a new 4-bit floating-point precision format optimized specifically for large-scale transformer inference, resulting in a 7.5× increase in dense throughput compared to Hopper’s FP8 (15 PFLOPS vs. ~2 PFLOPS). The GPU also features 288 GB of HBM3e memory, up to 3.6× more than the H100/H200, and 2× faster attention computation to support extended context windows and real-time interactive AI performance.
Scaling in the Blackwell architecture begins with the Grace Blackwell Ultra Unit (GB300), which pairs 1 Grace CPU with 2 Blackwell Ultra GPUs, connected via NVLink-C2C at 900 GB/s to ensure high-bandwidth coherency without PCIe bottlenecks. This foundational module is scaled up to create the NVL72, where 72 Blackwell Ultra GPUs and 36 Grace CPUs operate as a single logical computer. The NVL72 utilizes the NVLink Switch Chip to achieve 130 TB/s of rack-scale bandwidth, keeping all 72 GPUs in a coherent domain. This design allows the entire rack to behave like an “AI fabric” purpose-built for trillion-parameter model training and real-time inference systems, offering 1.1 exaFLOPS of FP4 compute and automatic workload distribution as if it were a single, enormous GPU.
Blackwell Ultra GPUs push power density to new levels, with up to 1,400W TGP, nearly double the thermal envelope of H100, which exceeds the limits of traditional air-cooled rack designs. Consequently, Direct Liquid Cooling (DLC) becomes the operational baseline for running Blackwell Ultra in high-density environments. Although integrating DLC adds upfront costs (e.g., ~$49,860 in dedicated hardware for an NVL72 rack), the operational economics quickly offset this investment. This shift converts the thermal challenge into an efficiency gain, resulting in as much as 40% lower electricity costs due to higher cooling efficiency and reduced reliance on traditional HVAC. This increased efficiency contributes to significant economic returns; for example, a $5 million GB200 NVL72 system is projected to generate $75 million in token revenue, representing a 15× return on investment.
Blackwell Ultra includes enterprise-strength features critical for moving GPUs from the lab to regulated, production environments. A key feature is Confidential Compute achieved by bringing TEE-I/O directly onto the GPU, which enables encrypted compute where sensitive data never leaves the trusted boundary. Importantly, this is achieved with near-zero performance loss. Furthermore, the platform accelerates data-heavy pipelines through a dedicated decompression engine that supports formats like LZ4, Snappy, and Deflate, removing the burden from CPUs. System reliability is strengthened by the RAS engine, which uses AI-driven telemetry for predictive fault detection, helping to reduce unplanned downtime in large GPU clusters.
Initial availability for Blackwell Ultra systems is expected in Q4 2025, with supply being prioritized for hyperscale and early enterprise demand, constrained by components like HBM3e and CoWoS-L packaging. The long-term value of Blackwell Ultra is secured through its tight integration with NVIDIA’s software stack, maintaining absolute CUDA compatibility to ensure seamless application transition. Performance continues to increase post-deployment through incremental software releases like TensorRT-LLM and compiler improvements, which unlock additional throughput and reduce cost per inference over the hardware’s lifecycle. Looking ahead, Blackwell Ultra serves as the bridge architecture until the introduction of Rubin (R200) in Q2 2026, which is designed around HBM4.
We are writing frequenly. Don’t miss that.

Unregistered User
It seems you are not registered on this platform. Sign up in order to submit a comment.
Sign up now