Jun 14, 2026 · 2:08 AM
Subscribe
Home Ai

Meituan puts avatar video startups under new pressure

Meituan has released LongCat-Video-Avatar 1.5 as an open avatar video model with local deployment potential. The release increases pressure on synthetic media startups whose value depends mainly on avatar generation rather than workflow, trust, and enterprise controls.

Elroy Fernandes
· 5 min read · 622 views
Meituan puts avatar video startups under new pressure

LongCat-Video-Avatar 1.5 makes locally run avatar video feel less like a research demo and more like a product threat. That is the part startups should pay attention to.

Meituan has released LongCat-Video-Avatar 1.5, and the timing matters because synthetic video is moving from expensive, hosted platforms into developer workflows that can be run, tested, and modified locally. For entrepreneurs building avatar tools, virtual presenters, AI educators, or synthetic influencer systems, that shift changes the economics of the category.

The model comes from the Meituan LongCat team and is now listed on Hugging Face with code and weights available under an MIT license. The repository shows a 74.9 GB model package with assets, base models, LoRA files, a scheduler, a vocal separator, and Whisper-large-v3 components uploaded this week. That is not a small download, and it is not a one-click consumer product. But it is a very different proposition from paying per minute through a closed avatar platform.

According to the Hugging Face model card and related project materials, LongCat-Video-Avatar 1.5 supports audio-text-to-video, audio-image-text-to-video, and audio-driven video continuation. The upgrade replaces the earlier Wav2Vec2 audio encoder with Whisper-large-v3, a move the team says improves lip synchronization, multilingual speech handling, facial dynamics, and long-form identity consistency. It also uses DMD2-based few-step distillation to reduce denoising to 8 steps, which is where the startup story begins.

Avatar video has always had an awkward cost structure. The user sees a talking face, but behind that simple output sits image generation, speech alignment, temporal consistency, video rendering, identity preservation, safety filters, and often a cloud GPU bill. That is why commercial tools such as HeyGen, Synthesia, D-ID, Kling Avatar, and newer avatar products have been able to sell convenience at a premium.

LongCat-Video-Avatar 1.5 does not erase that market overnight. It still requires technical setup, serious hardware, and workflow knowledge. Yet open releases like this push the baseline capability into the hands of developers who can build around ComfyUI-compatible pipelines, automate generation, and avoid API latency or per-use platform fees. The first-order effect is experimentation. The second-order effect is price pressure.

AIbase reported that the Meituan team claims roughly 15 times faster inference than the previous generation setup, with a 10-second video taking about one minute in its testing. KuCoin’s flash report, citing BlockBeats, noted an evaluation set of 508 image-audio pairs, 770 evaluators, and 13,240 judgments, with the team comparing results against HeyGen, Kling Avatar 2.0, and OmniHuman-1.5. Those numbers should be treated as vendor-side benchmarks, but they still show where the competition is headed: faster, cheaper, and closer to local production.

This is not just about lip sync. Better mouth movement is useful, but buyers rarely purchase a model because a benchmark improved. They buy a workflow that saves time, makes content reliably, protects rights, handles approvals, and fits into the business process. When the foundation capability becomes more accessible, the startup moat has to move somewhere else.

Where Startups Still Have Room

The obvious mistake would be to assume open source kills the whole market. It usually does something more specific. It kills weak margins around undifferentiated capability. If a startup’s only claim is that it can animate a head from audio, that claim is getting less defensible. If it owns distribution, brand trust, vertical workflows, compliance, editing tools, or enterprise governance, the release may actually become useful infrastructure.

Think about training platforms that need avatars for internal courses, sales teams that generate localized product videos, education startups that turn lessons into presenter-led clips, or creator tools that need fast iteration before publishing. In each case, the model is only one part of the stack. Businesses still need voice rights, consent records, prompt controls, watermarking, audit logs, team permissions, asset management, and review workflows. Those are not glamorous features, but they are where budgets often get approved.

There is also an important caveat in Meituan’s own project page. The team says part of the showcased images and audio were derived from real videos for research demonstration, and that the generated content is for academic use only, with commercial use not permitted for those displayed materials. That does not cancel the broader open-source significance, but it reminds founders that legal and operational readiness is not the same thing as model availability.

ComfyUI matters here because it is already a practical development environment for builders working with image and video generation. A model that fits into that ecosystem can spread through workflows, forks, wrappers, optimizations, and community examples faster than a polished but closed product can respond. That kind of distribution is quiet at first, then suddenly visible when small teams start shipping features that looked expensive a quarter earlier.

The next test is not whether LongCat-Video-Avatar 1.5 beats every commercial system in every case. The test is whether it becomes good enough for enough use cases that startups have to justify their pricing with more than model quality. That is where the market usually gets healthier and harsher at the same time.

For entrepreneurs, the takeaway is simple. The avatar layer is becoming less scarce. The businesses that survive will be the ones that turn synthetic video into a trusted production system, not just a clever generation demo. Watch how quickly developers package LongCat into usable workflows, and watch how fast commercial avatar companies answer with lower friction, clearer rights management, and better enterprise controls.

Also read: Hodl Hodl brings Lightning trading back into the small trade fightSpaceX Starship V3 reaches space in milestone test flightAnsel Adams trust condemns unauthorized AI colorized work

TOPICS
Elroy is a digital marketer and developer from Goa, with over a decade of experience web development and marketing. He has been associated with several startups and serves currently as an Editor to the Asia Pacific Industrial magazine. He occasionally writes on Startup Fortune about technology and automation.
Related Articles
More posts →
Loading next article…
You're all caught up