Ontario's AI scribe audit puts clinical procurement on notice

Ontario's AI scribe problem is not just a medical transcription story. It is a warning about what happens when public buyers treat clinical accuracy as a small feature instead of the product.

Ontario wanted AI scribes to give doctors back time. The province now has an audit showing that some of the approved systems could write the wrong drug into a medical note, miss mental health details, or invent treatment steps that were never discussed.

That is a serious failure because AI scribes are not being sold as toys for early adopters. They are being moved into clinics, attached to patient visits and presented as a practical answer to one of health care's most persistent problems: too much paperwork and not enough time with patients. The idea is simple enough. With consent, software listens to a clinical conversation and produces a SOAP note for the medical record. The risk is also simple. If the note is wrong, the error can travel.

Ontario Auditor General Shelley Spence released the special report on May 12, finding that AI scribe systems approved through Supply Ontario were not evaluated adequately before implementation. As Global News reported, the audit found that all 20 approved vendors had one or more inaccuracies during procurement testing, including hallucinations, incorrect information or missing details.

The most revealing part of the report is not that AI made mistakes. Anyone who has used generative AI around high-stakes information knows the technology can produce confident errors. The sharper point is how the province scored the systems.

Accuracy of generated medical notes counted for just 4% of the procurement score. Domestic presence in Ontario counted for 30%. Data privacy and legal controls were weighted at 23%, while system security controls accounted for 11%. The Trillium reported that bidders could score zero on system security, bias controls and medical-note accuracy and still meet the minimum score needed to become a vendor of record.

That should make every clinical AI startup pay attention. Procurement criteria are not paperwork. They are market signals. If a government says accuracy is worth 4% and local presence is worth 30%, vendors will optimize accordingly. Not because they are uniquely reckless, but because procurement systems create incentives. In health care, those incentives need to be brutally clear.

The testing details make the issue harder to dismiss. Supply Ontario used two simulated conversations between health-care workers and patients. According to the audit, 60% of approved AI scribes recorded a different drug than the one prescribed. Seventeen of the 20 missed key details about patients' mental health issues in at least one test. Nine fabricated information and suggested treatment steps, including therapy referrals or blood tests, even though those were not mentioned in the recordings.

These are not cosmetic errors. A typo in a note is annoying. A different drug, a missed mental health detail or an invented treatment plan is a clinical risk. The difference matters because many AI vendors still describe human review as the final safeguard. That is true in theory. It is weaker in practice.

Human review is not a business model

Stephen Crawford, Ontario's minister of public and business service delivery and procurement, has said medical practitioners will always review notes before decisions are made. He also said doctors using AI scribes are saving five to seven hours a week on average.

Both points can be true. The administrative burden on doctors is real, and AI transcription can help. But human review cannot become a liability shield that lets weak systems enter clinical settings. If a tool is sold on the promise of saving time, then every hidden verification step eats into that promise. If a doctor must hunt for hallucinated drugs, missing mental health context and invented treatment plans, the product has shifted work from writing to auditing.

That distinction matters for founders. The winning clinical AI companies will not be the ones that simply say a physician remains in the loop. They will be the ones that can prove where the loop works, how often it catches errors, what the system refuses to summarize, and how quickly performance is tested after updates. Health care buyers should ask for live demonstrations, adversarial test cases, audit logs, bias testing and privacy documentation before vendors are approved, not after they are already on the list.

The audit also found broader AI governance gaps inside the Ontario Public Service. Only 3% of staff had completed responsible AI training, and the approved secure generative AI tool was used only 6% of the time staff used generative AI. The report said the ministry had not blocked access to unsafe AI websites or put guardrails in place to prevent sensitive information from being uploaded to unsecured tools.

For the AI market, this is the real lesson. Governments are eager buyers, especially when software promises lower costs and faster service. But public procurement is now becoming a stress test for AI companies moving from demos into regulated workflows. Health care does not just need clever models. It needs systems that can withstand boring, strict, repeated evaluation.

Ontario's next move will matter less for one vendor list than for the standard it sets. If clinical AI is going to sit inside patient care, accuracy, security and privacy have to become pass-or-fail requirements. Anything softer tells the market that the paperwork matters more than the patient record.

Also read: MARA is betting its future on power for AI data centers • Origin Lab raises $8M to make game data useful for AI world models • Fractile raises $220 million as inference chips become the next AI fight