An Optane home server makes trillion parameter AI feel almost practical

A r/LocalLLaMA builder has shown that old enterprise memory can run a trillion parameter model at usable, if limited, speed. The bigger story is not raw performance, but whether founders should start looking at forgotten data center hardware as part of their AI cost strategy.

A hobbyist computer build using Intel Optane Persistent Memory has managed to run Kimi K2.5, a 1 trillion parameter model, locally at more than 4 tokens per second. That is not fast by cloud GPU standards. But it is fast enough to make people pay attention, because the system is built around secondhand server parts rather than a rack of premium Nvidia cards.

According to the r/LocalLLaMA post published on May 11, 2026, the machine uses an Intel Xeon Gold 6246, a TYAN S5630GMRE-CGN motherboard, an Asus RTX 3060 with 12GB of VRAM, 192GB of DDR4 ECC memory, and six 128GB Intel Optane DCPMM modules for 768GB of persistent memory. The builder said the Optane modules were used in Memory Mode, where the system sees the persistent memory as RAM and uses the DRAM as a cache.

That detail matters. Most local AI builds are limited by how much of the model can sit in VRAM or system RAM before performance collapses into slow disk offload. Optane sits in an odd middle ground. It is slower than DRAM, but much faster and more memory-like than ordinary storage. For a giant mixture-of-experts model, that can be enough to keep the system moving.

The model in question was Kimi K2.5, running through llama.cpp with an Unsloth Q2_K_XL quantization. The builder used hybrid GPU and CPU inference, placing attention weights, dense layers, shared experts, and routing components on the 12GB GPU where possible, while the bulk of the sparse expert weights lived in Optane-backed memory. The claimed generation result was around 4 tokens per second, with a later benchmark comment showing prompt processing at 16.44 tokens per second on pp512 and token generation at 5.35 tokens per second on tg128.

Those numbers should be read carefully. This is not a peer-reviewed benchmark, and it has not yet been reproduced across a set of comparable builds. It is a single detailed post, with enough configuration information to be interesting but not enough to treat as a new standard. Prompt processing also becomes painful at long context. A commenter quickly noted that a 10,000 token prompt could take roughly 10 minutes before generation even begins.

Still, the cost comparison is difficult to ignore. The builder said the parts were bought from late 2025 to early 2026 for about $1,900. Other commenters estimated a similar build closer to $2,000 to $2,500 depending on used market pricing. A current H100 rental often runs roughly $2 to $4.50 per GPU hour, while B200 rental can sit around $5 to $6 per hour on mainstream cloud providers, with some marketplaces advertising lower prices. A founder experimenting for a few weekends can burn through the price of this machine quickly if the workload requires premium GPUs.

That does not mean the Optane build beats the cloud. It does not. A B200 can push large models far faster, offers 180GB to 192GB of high-bandwidth memory, and saves a team from hardware assembly, power draw, BIOS constraints, and troubleshooting. But the home server changes the question. For offline testing, model exploration, private data experiments, and low-volume internal tools, owning cheap memory capacity can be more useful than renting speed by the hour.

Optane is both the opportunity and the limit

The awkward part is that Intel discontinued Optane. Intel said after its 2022 earnings that it would cease future Optane development, and its support pages now list Optane Persistent Memory 100 and 200 series products as discontinued or past key lifecycle dates. That makes today's secondhand pricing look less like a stable market and more like a temporary arbitrage window.

Founders should treat it that way. If enough local AI builders decide Optane is useful, cheap 128GB and 512GB modules will not stay cheap forever. Compatibility is also narrow. First generation Optane DCPMM requires specific Xeon Scalable platforms, and the wrong CPU or motherboard can turn a bargain into a dead end. This is not like buying a consumer GPU and dropping it into any modern workstation.

The practical lesson is broader than Optane itself. AI infrastructure costs are increasingly shaped by memory placement, not just compute. A 1 trillion parameter mixture-of-experts model does not activate every parameter for every token, so clever offload strategies can make seemingly impossible systems work. The same logic is showing up in SSD offload, CPU inference, CXL memory expansion, unified memory machines, and software that decides which tensors deserve the fastest hardware.

For many startups, the better answer will still be smaller optimized models. A 30B, 70B, or 120B model running at comfortable speed on a workstation can beat a trillion parameter model that crawls through prompts. Customers do not care how large the model is if the product feels slow. The Optane build is impressive because it expands the menu of options, not because it replaces sensible deployment economics.

The next thing to watch is whether others reproduce the benchmark with different CPUs, App Direct mode, ktransformers, ik_llama, or newer sparse inference tricks. If they do, old server memory may become a real tool for AI prototyping. If they do not, this will still be a useful reminder: in the AI hardware race, yesterday's failed enterprise product can become tomorrow's founder experiment when the price falls far enough.

Also read: Meta shows why AI mandates can break employee trust • Palantir's NHS data access fight tests trust in health AI • AI founders now have a commencement problem to solve