SenseTime's SenseNova U1 ditches VAEs entirely to unify image generation and understanding

SenseNova U1 with NEO-Unify architecture eliminates visual encoders and VAEs entirely, achieving 31.5 PSNR natively on pixels and open-sourcing a 2B preview on Hugging Face.

SenseTime and Nanyang Technological University's S-Lab released SenseNova U1 on April 28, deploying the NEO-Unify architecture that strips out the visual encoder and variational autoencoder underpinning almost every major image generation system in use today. That is not a minor tweak. Every diffusion pipeline from Stable Diffusion to Flux depends on a VAE to compress images into a latent space before generation. NEO-Unify works directly on pixels, treating visual and language data as deeply correlated from first principles rather than stitching them together via adapters. The Hugging Face model card confirms an 8B base and an open-source 2B preview, with the full release on Hugging Face posted April 26.

The 2B preview hits 31.56 PSNR on image reconstruction, within striking distance of Flux's VAE score of 32.65, but without needing a separate encoder at all. On image editing, NEO-Unify scores 3.32 on the ImgEdit benchmark. Reddit's r/StableDiffusion community flagged the release immediately, noting the scaling improvements over previous architectures and confirming open-source plans from the developers directly.

SenseTime has been building toward this. The company open-sourced the original NEO architecture in December 2025 following benchmark wins on SuperCLUE and OpenCompass, then shipped SenseNova-MARS in January 2026, an open-source multimodal autonomous reasoning model in 8B and 32B variants. Chairman Xu Li stated in March 2026 that the company intends to charge future AI products based on problem-solving outcomes rather than token consumption, a pricing philosophy that could undercut Western API vendors on pure cost per result.

The VAE was always a practical workaround, not a fundamental truth. It encodes images into compact latent representations to make diffusion computationally tractable, but that compression step discards fine detail and introduces artifacts. Developers building image pipelines have spent years tuning around VAE limitations, swapping checkpoints, adjusting decode settings, and working with imperfect reconstructions. NEO-Unify removes the problem at the source.

The dual-stage fusion training strategy, Pre-Buffer and Post-LLM integration, lets NEO absorb language reasoning capabilities intact while building visual perception from scratch. That means the model does not sacrifice text performance to gain visual ability, a trade-off that has plagued multimodal training at scale. Understanding and generation now flow through the same pathway, which opens practical applications in image editing, conditional generation, and visual reasoning that are difficult to coordinate when separate models handle each task.

China's AI Iteration Pace

SenseNova U1 is not arriving in a vacuum. DeepSeek, Qwen, InternVL, and now SenseNova are all shipping competitive open-weight multimodal models at a pace that has compressed the credibility gap with Western labs considerably. The key pattern is rapid architectural experimentation combined with open or semi-open release, building developer ecosystems that Western closed-model vendors cannot easily replicate.

For enterprise buyers in markets where domestic infrastructure, local language support, and regulatory alignment matter, a capable open-weight model from a listed Chinese AI company represents a meaningful alternative to Midjourney, Adobe Firefly, or OpenAI's image APIs. SenseTime is already deeply integrated with domestic chip providers, and Longbridge reporting confirms the company expects its proportion of domestic compute to rise through 2026, insulating it from the export controls affecting Nvidia supply chains.

Implications for Builders

Open-weight multimodal models with competitive benchmarks and no VAE dependency simplify deployment stacks. Developers can strip out one component, reduce inference overhead, and reduce the surface area for visual artifacts in production. For startups building creative tooling, that translates directly into lower cost and fewer edge cases to engineer around.

The broader fragmentation of the global AI model stack is accelerating. Enterprise buyers now choose between a US-centric closed model tier, a growing open-weight Western layer anchored by Meta's Llama series, and an increasingly capable Chinese open-weight ecosystem. SenseNova U1 strengthens the third option. Watch the full open-source release, adoption in image-generation communities, and whether SenseTime's outcome-based pricing finds traction in enterprise deals outside China.

Also read: Disney dashboard reveals Claude power users hammering AI in workflows • OpenAI misses revenue and user targets as IPO looms amid rising competition • O'Leary's 9GW Utah AI campus approval signals hyperscalers must own their power plants