Andreessen Horowitz bets $9 million that AI reliability is a category, not a feature

Probably is making a blunt bet: enterprises won't put AI into serious work until someone can prove the answer, not just generate it.

The hallucination problem in enterprise AI isn't going away through better prompts. Probably, a startup founded by Peter Elias, has raised a $9 million seed round led by Andreessen Horowitz to build around a harder idea: if you want AI inside finance, healthcare, legal work, or data science, the model's answer has to come with proof.

That is the useful part of this story. Not another AI tool. Not another demo that looks impressive until someone asks where the number came from. Probably is trying to make AI behave more like software enterprises already understand, where a spreadsheet formula, a SQL query, or a validation rule can be checked after the fact. Language models don't naturally work that way. They produce likely answers. Sometimes those answers are right. Sometimes they're fluent nonsense.

For a consumer chatbot, that can be annoying. For a compliance officer, a CFO, or a hospital administrator, it's a real blocker.

According to TechCrunch's report on the round, Probably's first product is a data science tool that lets users query complex datasets and receive answers with citations and an audit trail showing how the result was derived. The important detail is the second step: the system uses a deterministic validator to check the LLM's first answer before it reaches the user. The model doesn't get the final word. A rules-based layer has to clear it.

That is a sharper business than it may sound at first. Most enterprise AI pilots already prove that people like fast answers. The harder test is whether anyone inside a large company is willing to sign their name to those answers when money, regulation, or patient safety is involved. If you can't show how the system got there, you don't have enterprise software. You have a guessing machine with a nice interface.

The 99.99% accuracy target Elias is talking about is ambitious, but the direction is right. Enterprises aren't comparing AI with a clever intern. They're comparing it with deterministic systems that already support payroll, reporting, payments, inventory, and audit work. A system that's mostly right may still be useless in a workflow where the wrong answer creates liability.

The market is moving from demos to proof

Andreessen Horowitz's check matters because it puts venture money behind reliability as a standalone category, not a small feature buried inside every AI app. Don't dismiss that as investor theater. Security became its own market because companies couldn't simply trust every cloud product to protect itself. Observability became its own market because distributed software broke in ways people couldn't see from the outside. AI reliability has the same shape, with a stranger problem underneath it: the system can be wrong while sounding perfectly confident.

You can already see why that creates demand. A May 2026 arXiv paper by Zhenyue Zhao, Yihe Wang, Toby Stuart, Mathijs De Vaan, Paul Ginsparg and Yian Yin audited 111 million references across 2.5 million papers and estimated that 146,932 hallucinated citations appeared in 2025 alone. That isn't a tiny formatting problem. It is false information entering the record at scale, often in a place where the whole point is to be verifiable.

KPMG has just handed the market an even cleaner example. As the Financial Times reported this month, the firm withdrew an agentic AI report after GPTZero and the FT found false or misleading case studies involving organizations including UBS, the NHS, Swiss Federal Railways and Transport for London. KPMG is exactly the kind of institution that sells trust for a living. When its own AI report becomes a case study in AI hallucination, the sales pitch for verification gets easier.

Frankly, that is the part every founder building enterprise AI should pay attention to. The market is past being impressed by a fluent answer. Buyers now want citations, audit logs, validation steps, error rates, and someone accountable when the system fails. If your product can't provide those things, you're asking a serious customer to take a leap it doesn't need to take.

Probably's approach is strongest where correctness can be defined clearly. Data science queries are a sensible starting point because many answers can be traced back to source tables, calculations, and documented assumptions. The harder cases will come when the output is more judgment than computation. A validator can check whether a number came from the right dataset. It can't always decide whether a business conclusion is sound.

That doesn't weaken the company’s thesis. It makes the scope more honest. AI reliability won't be one magic wrapper that makes every model safe for every job. It will be a stack of checks, citations, policies, evaluations, and audit trails, each useful only where it maps to a real workflow. Probably is starting in the part of the market where that mapping is clearest.

The open question is whether it can turn that first product into a broader platform before the model providers, cloud companies, or data warehouse vendors build enough reliability features themselves. That is the normal danger for infrastructure startups. The best wedge becomes a feature if the platform catches up.

Still, the timing is good. Enterprises don't need another promise that the next model will hallucinate less. They need a way to know when today's model is wrong. Probably is selling that piece directly, and for now that is the more serious bet.

Also read: Plaud reached $250 million in recurring revenue without a single venture dollar and is now targeting $500 million in 2026 sales • Mobileye bets its sensor stack can make the leap from supplier to robotaxi operator • Bland AI raised $40 million after 180 investors said voice would be dead in a year