DeepSeek's first multimodal update is a quiet but significant milestone: the Hangzhou startup that reset AI cost expectations with text is now pointing its whale at images and video, and the rest of the market is already adjusting.
If you needed a sign that DeepSeek's ambitions are not slowing down, it came this week in the form of a new tab. On the company's chat interface, an image recognition mode appeared alongside the existing expert and flash options, initially offered to a select group of users for beta testing. That may sound modest. It is not. This is the first time multimodal capability has appeared in DeepSeek's flagship chatbot, and it lands just four days after the company released V4, which was already moving benchmarks across the industry. The announcement was made by multimodal team leader Chen Xiaokang on social media, and a senior researcher named Chen Deli followed shortly after on X with a line that captured the moment better than any marketing copy could: "the little whale can now see."
The whale reference matters because it signals that the team is confident this is not an incremental patch. DeepSeek's logo is a whale, and the framing of vision as a capability the whale has now unlocked is both playful and pointed. The model can process images and video in addition to text, bringing it in line with rivals at OpenAI, Anthropic, and Google that already offer the function. The difference is who DeepSeek is, a company that has consistently delivered capabilities at a fraction of the cost and with enough benchmark performance to make its Western counterparts explain themselves publicly.
Multimodal is not a new capability in the AI market. It has been available from GPT-4 and Gemini for some time. What changes when DeepSeek adds it is the pricing context and the substitution calculus. Since V4 launched on April 24, it has attracted serious attention because its benchmarks are either comparable or superior to the top closed-source models from competitors, and its pricing is aggressive. Adding vision to that package means users who were already migrating from OpenAI or Anthropic for text tasks now lose one more reason to keep the premium subscription. If the image mode works well, it removes a final functional gap.
DeepSeek V4 is built on a native multimodal architecture, meaning image and video understanding were integrated during pre-training rather than bolted on afterward. That distinction matters because a model trained to understand images from the ground up should handle cross-modal reasoning more naturally than one that treats vision as a plugin. When a user asks the model to look at a screenshot, find the bug, and generate a corrected version, the quality of the answer depends on how tightly the visual and linguistic layers are connected. Native architecture is the better foundation for that kind of task.
The limited rollout to select users for beta testing is also worth reading carefully. DeepSeek is not announcing a full launch with a press conference. It is graying the feature gradually, the Chinese term 灰度 appears repeatedly in Chinese tech media covering the rollout. That kind of phased approach suggests the team wants to watch the behavior at scale before committing to general availability. Given that the model is already facing enormous demand since V4 launched, a staged release is operationally sensible. It is also the kind of measured product management that differs from the hype-heavy rollout style more common in US companies.
The Chinese Market Is Already Feeling It
The pressure DeepSeek is putting on the Chinese AI landscape is already visible. Competing labs like Zhipu, whose GLM models have built a strong user base inside China, have started offering discounts to retain users who are tempted by DeepSeek's pricing and performance combination. That is a significant behavioral signal. When a competitor starts cutting prices defensively, it means they believe the threat is real and that users are actually switching. DeepSeek is not just pressuring Western giants. It is reshaping the incentive structure for every Chinese AI lab trying to charge for access in the same market.
That competitive pressure will intensify with multimodal. Many of the use cases that drive premium subscription behavior involve processing images and documents rather than pure text. Enterprise users analyzing financial reports, engineers reviewing schematics, researchers working with datasets that include tables and graphs, all of them benefit from vision capability. As long as DeepSeek was text-only, it was missing some of those segments. The image recognition mode changes that. And because the pricing remains far below what most Western models charge for equivalent capability, the enterprise cost argument gets sharper every time DeepSeek adds a function that competes with the premium tier elsewhere.
What Comes After The Beta
The most relevant question now is how quickly the image mode reaches general availability and what the model quality looks like under real workload conditions. Beta access for select users is a controlled signal. The more meaningful test will come when millions of people start uploading images to DeepSeek the way they currently send text prompts. That will stress the infrastructure, surface edge cases in the vision model, and tell the team which use cases are most popular in practice. If the rollout is smooth and the quality holds, DeepSeek will have closed the last obvious capability gap between its product and the full-featured Western alternatives.
There is also the question of what comes after vision. DeepSeek V4 was described before launch as having native video generation capabilities in addition to image understanding. The current beta appears to cover image recognition. If video follows, the company is not just matching the current frontier. It is building toward the kind of unified creative and reasoning workflow that would make it a default tool for a much broader range of tasks. That would represent a more serious challenge to existing platforms than any single benchmark comparison has managed to convey.
For now, the whale can see. That alone changes the calculus for every AI platform that thought its multimodal advantage was a durable moat. DeepSeek has a pattern of treating features that competitors charge heavily for as table stakes, then releasing them at cost prices that force the entire market to adjust. Vision is next in that pattern. The downstream effects on pricing, user migration, and competitor strategy will play out over the next few months, but the direction is already clear.
Also read: Shapes wants to turn AI companionship into a group chat product • Firestorm Labs is turning shipping containers into a new kind of defense factory • Taylor Swift deepfake scam ads show how AI has made celebrity fraud cheap and scalable