Why Voluntary AI Guidance Falls Short

Barry P. Chaiken, MD

Strategic Healthcare Advisor | Physician | AI & IT Expert | Keynote Speaker | Author of Future Healthcare 2050

February 12, 2026

On February 9, researchers at the University of Oxford published a sobering finding in Nature Medicine. In a randomized study of 1,298 participants, three commercially available large language models (LLMs), including OpenAI’s ChatGPT, failed to improve medical decision-making when real people used them.

The models, tested in isolation, correctly identified relevant medical conditions ninety-five percent of the time. Yet participants using those same models identified the correct conditions less than thirty-five percent of the time, no better than a control group using Google. Most participants chose the wrong course of action, and the models showed alarming inconsistency: two users describing identical symptoms of a subarachnoid hemorrhage in slightly different words received opposite advice. One was told to rest at home. The other was directed to call emergency services.

The study’s conclusion was direct: none of the tested models were ready for use in direct patient care. The researchers recommended systematic human user testing with diverse, real users before any public deployment in healthcare.

This is not an abstract risk. Over 230 million people ask ChatGPT health questions every week. In January 2026, OpenAI launched ChatGPT Health, a dedicated product that connects medical records, lab results, and wearable data directly to its chatbot. OpenAI states the tool is “not intended for use in the diagnosis or treatment of any health condition.” Yet it now ingests electronic health records for millions of users, making consequential decisions about their care.

If the Oxford study shows that general-purpose LLMs fail real users on medical scenarios, what assurance do we have that a branded health product built on the same underlying technology and unknown training data will perform differently? ChatGPT Health, like any artificial intelligence (AI) tool that touches patient welfare, should undergo transparent, independent evaluation before deployment, not after.

This concern extends beyond consumer applications. The same interaction failures that plague patient-facing AI will surface in clinician-facing tools. If a patient cannot communicate symptoms effectively to an LLM, a clinician overwhelmed by alert fatigue may similarly fail to extract reliable guidance from a decision-support system that presents too many options without clear priority. The human-AI interaction problem is universal.

An Approach Worth Acknowledging

Against this backdrop, the Joint Commission and the Coalition for Health AI (CHAI) released their joint Guidance on the Responsible Use of AI in Healthcare (RUAIH), a seven-element framework addressing governance structures, patient privacy, data security, ongoing quality monitoring, voluntary safety-event reporting, risk and bias assessment, and education and training.

For an industry that has largely deployed AI tools without coordinated standards, the existence of joint guidance from the nation’s primary healthcare accreditor and a leading AI coalition is meaningful. The categories are correct. The intent is sound. The document represents the beginning of a necessary conversation about how healthcare organizations should implement AI responsibly.

But a beginning is not a destination. And the structural limitations of the RUAIH framework raise serious questions about whether voluntary, organization-level self-governance can protect patients at the speed this technology demands.

When Incentives and Safety Collide

The RUAIH framework is voluntary and self-reported. It relies on organizations to govern themselves, monitor their own AI tools, and voluntarily report safety events. History teaches us what happens when high-stakes industries self-regulate without external verification, particularly when economic incentives conflict with public safety.

Social media offers a cautionary precedent. In 2018, Meta began studying how Instagram’s beauty filters affected young users. Internal research found that the filters harmed teenage girls’ mental health and body image. The company banned them in 2019. But in 2020, Mark Zuckerberg overruled his own employees, including an executive whose daughter suffered from body dysmorphia. He reinstated the filters because they drove engagement, the metric that generates advertising revenue. When profit and safety collided, profit won. Lawsuits now proceeding in federal and state courts allege that these design choices constituted personal injury through addictive product design.

Healthcare AI faces the same structural tension. Vendors racing to market, health systems under financial pressure to demonstrate return on investment, and investors demanding growth all create incentives to deploy faster than governance can mature. For-profit and even nonprofit organizations face economic pressures that may conflict with patient interests. Voluntary systems that ask these same organizations to police themselves assume that economic self-interest will align with patient safety. The social media experience suggests otherwise.

Beyond incentives, the RUAIH framework contains a critical structural gap. It calls for “ongoing quality monitoring” but places the burden entirely on deploying organizations. There is no independent body performing cross-institutional benchmarking, no public registry of performance data for clinicians and patients to consult, and no mechanism for recognizing patterns of failure across organizations.

The Oxford study exposes exactly this gap: its central finding is that the failure mode in healthcare AI is the human-AI interaction itself, something no single organization can evaluate in isolation and that no current framework requires anyone to evaluate at all.

The reporting mechanism proposed in the RUAIH guidance illustrates this problem precisely. Element five calls for voluntary, blinded reporting of AI safety-related events and alludes to a centralized repository, but remains vague about who would manage it, how it would function, and what authority it would carry. Patient Safety Organizations (PSOs) already exist in healthcare for this purpose, yet medical errors remain a leading cause of harm, and reporting remains limited. The current system has not produced the culture of transparency that patient safety demands.

Aviation offers a telling contrast. NASA’s Aviation Safety Reporting System (ASRS) succeeds because it is managed independently from the regulator, the FAA, and offers genuine liability protection to those who report. That independence and legal shield built a culture of voluntary disclosure over decades. Without comparable independence, robust confidentiality protections, and concrete steps to address malpractice liability that discourage candid reporting, the RUAIH approach to AI safety events risks replicating the underreporting that has limited PSO effectiveness. The guidance acknowledges the need for a centralized mechanism but does not build one.

The Meaningful Use Warning

We have traveled this road before. Almost two decades ago, the federal government brought together stakeholders with the best of intentions to digitize healthcare through the adoption of electronic health records. Meaningful Use was designed to improve quality, safety, and efficiency. What it produced was a compliance-driven process that prioritized checkbox adoption over outcome measurement.

Nearly twenty years later, we still do not have robust interoperability. The lesson is not that collaboration fails. The lesson is that committee-driven consensus, when implementation favors expediency over outcomes, produces systems that burden clinicians without delivering promised safety gains.

Healthcare AI is evolving faster than any technology in history. Algorithms update weekly. Models retrain on new data continuously. Leaving governance to a bureaucratic committee or allowing individual organizations to self-police repeats the Meaningful Use mistake at a faster pace and with higher stakes. The resources required to properly evaluate healthcare AI, continuous performance monitoring, cross-population bias detection, human user testing, and real-world outcome tracking are too great for any single organization to bear. Efforts must be combined to be effective.

This is why I have proposed the Healthcare AI Review and Transparency (HART) initiative: a national public-private partnership that would serve as the centralized clearinghouse the RUAIH guidance envisions but does not build. HART would provide independent, continuous evaluation of AI tools before and after deployment. It would maintain a public registry of performance data accessible to clinicians, developers, and patients. It would enable cross-institutional pattern recognition that no single hospital can achieve on its own.

Critically, HART would house the independent safety-event reporting system that healthcare AI requires, modeled on the ASRS principle that the body receiving reports must be independent from the body with enforcement authority, with genuine confidentiality protections that encourage candid disclosure. And it would carry corrective action authority that voluntary frameworks cannot provide. Where RUAIH asks organizations to watch themselves, HART provides the external accountability that the trust covenant between clinicians and patients demands.

The Oxford researchers’ call for “systematic human user testing with diverse, real users” aligns directly with what HART would mandate, not as a voluntary aspiration but as a structural requirement. The fear is straightforward: without centralized, independent evaluation, the focus on profits will lead to shortcuts rather than safe, affordable, accessible, high-quality healthcare.

The Way Forward

The disparate, committee-driven approach to healthcare AI governance will be too slow and too unfocused to safely manage a technology that updates weekly and touches every patient. Healthcare leaders must move beyond the illusion that voluntary self-governance, however well-intentioned, will protect patients. The industry needs a centralized, transparent, independent, and accountable evaluation mechanism that unites regulators, clinicians, developers, and patients under shared standards. Not more committees. Not more frameworks. Shared accountability with independent verification.

In Future Healthcare 2050, I explore these governance challenges in depth and provide strategies for building trust between technology and the clinicians who wield it. Order your copy today as an eBook or hardback at Barnes and Noble or Amazon.

Let us continue this conversation. Do you believe voluntary frameworks are sufficient to govern healthcare AI, or does the industry need independent, centralized oversight? Share your perspective in the comments.

Sources:

1. Bean, A. M., Payne, R. E., Parsons, G., et al. “Reliability of LLMs as medical assistants for the general public: a randomized preregistered study.” Nature Medicine, February 9, 2026. https://doi.org/10.1038/s41591-025-04074-y
2. Rosenbluth, T. “Health Advice From A.I. Chatbots Is Frequently Wrong, Study Shows.” The New York Times, February 9, 2026. https://www.nytimes.com/2026/02/09/well/chatgpt-health-advice.html
3. Joint Commission & Coalition for Health AI. “The Responsible Use of AI in Healthcare (RUAIH).” 2025. https://digitalassets.jointcommission.org/api/public/content/dcfcf4f1a0cc45cdb526b3cb034c68c2
4. OpenAI. “Introducing ChatGPT Health.” January 7, 2026. https://openai.com/index/introducing-chatgpt-health/
5. Kang, C. “Social Media on Trial.” The Daily, The New York Times, January 29, 2026.
6. Chaiken, B. P. “AI’s 1929 Moment: Building HART for Healthcare Safety.” Future-Primed Healthcare, LinkedIn, 2025.
7. Chaiken, B. P. “Beyond HART: Capitalism’s Safety Net for AI.” Future-Primed Healthcare, LinkedIn, 2025.

Dr. Barry Speaks

To book Dr. Chaiken for a healthcare or industry keynote presentation, contact – Sharon or Aleise at 804-464-8514 or info@barrychaiken.com

Related Newsletters

January 28, 2026

When AI Gets the Facts Backward

Last week, technology strategist Shelly Palmer published a cautionary account of artificial intelligence (AI) failing at a task many assume it performs reliably: fact-checking. Palmer...

January 21, 2026

Who Owns the Data That Trains Healthcare AI?

In December 2023, The New York Times filed suit against OpenAI and Microsoft, alleging that millions of its articles had been used to train AI...

January 10, 2026

As Federal AI Guardrails Fall, Healthcare Must Build Its Own

In 1982, seven people died from cyanide-laced Tylenol in Chicago. Johnson & Johnson responded with full transparency, recalling all 31 million bottles—over $100 million of...