What are you looking for?

Explainer: Why storage and memory are the new "AI Database" for AGI

Venis Zhu, Analyst. Published: Jan 2026

NAND flash memory has traditionally been seen as a storage-focused technology, separate from the high-performance needs of AI. That may be changing as AI workloads grow in size and complexity. Inference, the stage where trained AI models process data to generate predictions or responses, demands both fast access to context and large memory capacity. 

At CES 2026, Nvidia officially launched the Rubin platform, addressing the “memory wall” that hinders long-context AI by introducing a dedicated solution for key-value(KV) cache offloading.1 By enabling AI models to offload KV caches from GPU memory to high-speed SSDs, Rubin not only improves performance and efficiency for long-context inference but also creates a significant new source of demand for storage solutions. 

To solve the high cost of trapping context in GPU memory (HBM), this solution leverages BlueField-4 DPUs and Spectrum-X Ethernet switches to create a high-speed data path. This architecture bypasses traditional CPU bottlenecks, delivering 5x higher tokens per second and 5x better power efficiency for long-context inference compared to legacy storage methods.

Third Bridge spoke with industry experts to explore why storage and memory are the new "AI Database" for AGI, and the implications of Nvidia’s latest Rubin platform for the storage market.

Why is memory a bottleneck for AI now?

Third Bridge experts say that Agentic AI (multi-step reasoning) is pushing context windows from 100K to 100M tokens. Because KV cache grows linearly while compute grows quadratically, storage (specifically SSDs) becomes the primary bottleneck. The cost of retaining this massive context in GPU memory (HBM)—or repeatedly recomputing attention—is prohibitive.

Nvidia explicitly marketed its Inference Context Memory Storage (ICMSP) for "Agentic AI" and "long-context multi-turn inference," confirming our expert's prediction that 2026 would mark the "Agentic" paradigm shift. As CEO Jensen Huang said at CES 2026, “For storage, that is a completely unserved market today. This is a market that never existed, and this market will likely be the largest storage market in the world, basically holding the working memory of the world’s AIs.”

Why standardized solutions are key for efficient KV cache offloading?

Third Bridge experts highlighted that current "HBM → CPU DRAM → SSD" three-tier offloading architectures are inefficient due to CPU involvement and slow PCIe links. They noted the industry lacked a standardized reference architecture for offloading HBM to SSD and specifically speculated about future "direct SSD to NVLink" solutions.

Nvidia is effectively productizing this "system-level optimization" by providing a standardized API for storage vendors (e.g., Dell, Pure Storage) to plug directly into the GPU cluster. This eliminates the need for custom, "overfitted" engineering, significantly increasing the probability that KV-cache tiering becomes a repeatable standard across hyperscaler deployments.

Why could the evolution of inference KV cache unlock massive SSD demand?

Third Bridge experts calculated that retaining full KV cache for a 100B model with 10 million DAU would require 250 PB of SSD storage daily. Without cheaper storage tiers, firms are forced to delete cache aggressively (e.g., every 0.5–1 hour), wasting compute resources.

Nvidia's roadmap introduces a mechanism to handle this demand: persistent, shareable inference context stored outside HBM. By unlocking higher utilization and lower cost-per-token, this architecture implies a significantly increased SSD-per-GPU ratio in inference-optimized clusters (e.g., NVL72) to accommodate the massive data footprint our experts forecasted.

Looking ahead

Looking further ahead, Third Bridge experts predicted High Bandwidth Flash (HBF)—stacking NAND directly on chips or closer to the processor—would be a critical hardware trend to solve capacity constraints.

The announcement by SanDisk and SK Hynix to standardize HBF aligns perfectly with this view. Positioned to offer bandwidth comparable to HBM but with 8x–16x the capacity at a similar cost, HBF is targeted specifically at AI inference workloads, with sampling expected to begin in 2H 2026 and inference devices arriving in early 2027.


All insights in this article are based on information shared by Third Bridge experts. 

For media enquiries, please contact us at comms@thirdbridge.com.

References:

1. KV cache offloading is the process of moving key-value memory used by AI models from limited GPU memory to high-speed storage, enabling efficient long-context inference without overloading GPU resources

Transcript references:

1. 2025/11/11 - AI Inference – Memory Demand, KV Cache Offloading & Multimodal Model Developments – Part 1 (Conducted in Mandarin) - Former Senior Manager at Shanghai Artificial Intelligence Laboratory

2. 2025/11/13 - AI Inference – Memory Demand, KV Cache Offloading & Multimodal Model Developments – Part 2 (Conducted in Mandarin) - Former Senior Manager at Shanghai Artificial Intelligence Laboratory