A recycled enterprise memory format has just made a trillion-parameter model run on a workstation with one consumer GPU. The speed is modest, but the signal for AI infrastructure builders is not.
The interesting part of the latest local AI experiment is not that Kimi K2.5 ran at roughly four tokens per second. It is that it ran at all on a machine built around secondhand Intel Optane Persistent Memory, a product Intel started winding down in 2022 after it failed to become a mainstream data center staple.
As Tom's Hardware reported on May 23, a LocalLLaMA user known as APFrisco used six 128GB Intel Optane DCPMM modules, 192GB of Samsung DDR4 ECC memory, an Intel Xeon Gold 6246 CPU, and an Asus Dual GeForce RTX 3060 with 12GB of VRAM to run Moonshot AI's Kimi K2.5 locally. The model is a roughly one-trillion-parameter mixture-of-experts system, which means the full model is enormous even though only a slice of its parameters is active for each token.
That distinction matters. Most people still think of local AI in terms of GPU memory. If the model does not fit in VRAM, the answer is usually to buy a bigger GPU, rent cloud hardware, or choose a smaller model. This build takes a different path. It treats memory capacity as the constraint to solve first, then lets the GPU handle the parts where it can actually help.
Optane was designed to sit between DRAM and storage. It is byte-addressable like memory, larger and cheaper per gigabyte than DRAM in many secondhand listings, but slower than conventional RAM. For databases and enterprise storage workloads, that middle position was difficult to sell at scale. For large local models, it suddenly looks more interesting.
In APFrisco's setup, the Optane modules were used in memory mode, with the DDR4 sticks acting as cache. That gave the system enough addressable memory to host a very large quantized Kimi K2.5 build, while llama.cpp handled hybrid GPU and CPU inference. Some attention weights, dense layers and routing components fit onto the 12GB RTX 3060, while the bulk of the sparse expert weights lived in system memory and were pulled as needed.
This is not a clean replacement for an H100 cluster. Nobody should pretend four tokens per second is fast. A user waiting for a long answer will feel the delay, and commercial chatbot workloads need far better latency. But that is the wrong comparison. The more useful comparison is against not being able to run the model locally at all.
For founders, independent researchers and infrastructure teams working with sensitive data, the ability to test a frontier-scale open-weight model without paying cloud inference bills can change the economics of experimentation. You can run evaluations overnight. You can process documents in batches. You can test agent workflows where latency matters less than privacy, reproducibility or cost control.
The real story is memory arbitrage
The Optane build is also a reminder that AI infrastructure does not move in a straight line. The market is obsessed with the newest accelerators, but useful capacity often appears when an older enterprise product falls out of favor. Intel's decision to wind down Optane left a pool of unusual hardware with no obvious mass-market future. Local AI builders have now found one.
There are limits. Optane DIMMs require compatible Xeon platforms, the used market is uneven, and performance depends heavily on model architecture, quantization, tensor placement and inference software. A mixture-of-experts model like Kimi K2.5 is especially suited to this kind of tiered memory approach because not all weights are used at once. A dense model of similar size would be much less forgiving.
The more important question is whether this points toward a broader memory tiering future. SSD offloading, CPU inference, CXL memory expansion and hybrid GPU placement all attack the same problem: model weights are growing faster than affordable VRAM. If AI agents become more common in business workflows, many companies will care less about maximum tokens per second and more about whether they can run large models privately at predictable cost.
This is where the build becomes more than a curiosity. It shows that the hardware stack for AI may become more diverse, not less. Cloud GPUs will still dominate high-throughput production workloads, but local inference may increasingly be assembled from whatever gives builders the best balance of memory, latency, power and price.
The practical takeaway is simple. Cheap Optane DIMMs will not turn a workstation into a modern AI data center, but they do expose an opening in the market. If discontinued enterprise memory can make a trillion-parameter model usable at the edge, then the next wave of memory products, especially CXL-based systems, will have a much clearer sales pitch. Watch the memory layer. That is where some of the next AI infrastructure bargains may appear.
Also read: Meituan puts avatar video startups under new pressure • AI has reached its COVID shutdown moment for office work • Anthropic is moving Mythos 1 closer to Claude Code