Who Owns the Data That Trains Healthcare AI?
January 22, 2026
In December 2023, The New York Times filed suit against OpenAI and Microsoft, alleging that millions of its articles had been used to train AI models without permission or compensation. The lawsuit joined a growing wave of legal challenges from authors, musicians, and visual artists demanding recognition for their role in creating the data that powers artificial intelligence. These disputes remain unresolved, but a parallel development signals where the market is heading. In January 2026, Wikipedia announced licensing agreements with Amazon, Meta, Microsoft, Perplexity, and Mistral AI. The nonprofit organization that famously rejected advertising now receives payment for structured access to its data. Google signed a similar agreement in 2022; the rest of the industry followed.
This shift establishes a principle that is only beginning to reach healthcare: those who create the data that trains AI deserve a role in governing its use. Healthcare organizations and their patients now face the same question that media companies have confronted. Who owns the data that trains healthcare AI, and who benefits when that data generates commercial value?
Data Flows Without Governance
AI scribes that transcribe clinical encounters are now deployed at scale across U.S. health systems, with some institutions using these tools for millions of visits annually. To generate draft notes, AI scribes first create full transcripts of patient-clinician conversations. These transcripts represent an unprecedented record of clinical dialogue, yet most U.S. healthcare institutions are deleting them after notes are finalized. The primary driver is fear of medical malpractice liability. Transcripts create a discoverable record of what was discussed during clinical visits, and institutions destroy them before attorneys would likely file lawsuits.
The consequences of this practice extend beyond individual organizations. Properly validated transcripts provide ground truth to assess the accuracy of AI-generated summaries, yet peer-reviewed studies evaluating hallucination rates in real-world clinical settings remain scarce. A recent report documented an AI scribe hallucinating that a patient had diabetes and suspected heart disease, information that entered the patient’s chart and influenced care. Without transcript retention, such errors cannot be systematically identified, and the models cannot be improved.
While healthcare organizations destroy transcripts to avoid liability, AI companies retain de-identified versions to train the next generation of clinical AI tools. This asymmetry illustrates a broader governance gap. Electronic health record vendors are building AI capabilities using data from their clients’ systems, yet transparency about these practices remains limited.
Epic’s Cosmos dataset aggregates de-identified clinical data from hundreds of millions of patients to support the development of predictive and generative models. Oracle Health positions customer data within its cloud infrastructure as the substrate for AI training and inference. Neither company’s public disclosures detail how clinical data governance operates across customer boundaries or whether aggregated data flows back to corporate model development.
The transparency gap, not any allegation of misconduct, is the problem. Healthcare organizations and patients cannot make informed decisions about data use when vendor practices remain opaque. The risk of re-identification in large clinical datasets compounds this concern, as de-identification techniques may not adequately protect privacy when datasets are extensive and multifaceted.
Malpractice Reform: From Compensation to Prevention
The fear of malpractice liability is driving healthcare organizations to destroy transcripts that could improve AI safety. This calculus prioritizes legal risk avoidance over quality improvement, reversing the proper order of priorities.
The aviation industry offers an instructive parallel. In 1960, commercial aviation experienced 67 accidents per million departures globally. By 2022, this rate had fallen to approximately 1 accident per million departures. Fatal accidents dropped from approximately 40 per year in 1960 to an average of 5 per year by 2022, despite massive increases in air travel. This improvement came from the industry’s commitment to error investigation, data analysis, and robust safety reporting, all supported by a legal framework that encouraged openness rather than defensiveness. The Aviation Safety Reporting System, established in 1976 and managed by NASA, allows aviation professionals to submit confidential reports of safety incidents without fear of punishment or legal action.
Malpractice law should be revised to focus on preventing harm rather than compensating for it. Key reforms include safe harbor provisions for healthcare providers who report safety concerns in good faith, confidentiality protections ensuring that quality improvement data cannot be discovered or used as evidence in malpractice lawsuits, and a non-punitive approach focused on learning rather than blame. These reforms would allow healthcare organizations to retain transcripts for AI quality assurance without exposing themselves to litigation based on those same records.
A Roadmap for Data Stewardship
Patients have the right to control and consent to the use of their medical records under HIPAA. These rights include accessing and obtaining copies of their records, requesting amendments, controlling certain disclosures, and receiving an accounting of disclosures. These are not property rights in the traditional sense, but they are rights of control. Because patients can govern disclosure, they can also direct how their data is used for AI training and where any compensation flows.
A roadmap for responsible data stewardship should incorporate several principles. Patients should be able to designate how their information is used and where any compensation is directed. This could be as straightforward as directing any compensation to the healthcare organization’s foundation, aligning patient interests with the institutional mission. Healthcare organizations should receive compensation for data stewardship, tied to foundation-directed benefits rather than direct profit. The Wikipedia licensing model provides a conceptual template, but healthcare requires additional ethical, practical, and legal research to meet HIPAA requirements and address the unique sensitivity of medical information.
Industry-wide contractual standards should shift data control to healthcare organizations and patients. Vendor agreements should require explicit disclosure of data governance practices and obtain consent for AI training use. Neither Epic nor Oracle currently publishes detailed governance protocols that would enable healthcare executives to evaluate how their patients’ data contribute to AI development. This opacity should not continue.
Specifically, healthcare systems should retain AI scribe transcripts to validate model accuracy and detect hallucinations. Transcript retention supports quality improvement, enabling organizations to identify error patterns and work with vendors to improve model performance. The near-term future may offer rapid quality evaluation methods, but only if the source data are available to support such an assessment.
Human Oversight Remains Essential
AI tools must augment rather than replace clinical judgment. Healthcare professionals remain responsible for interpreting AI-generated results and considering biases in training data. Organizations must maintain human oversight to interpret AI-generated insights and make final decisions on safety interventions. This principle applies equally to AI scribes, clinical decision support, and any tool that influences patient care. The question of who owns training data does not change this fundamental requirement: clinicians remain accountable for the care they provide, regardless of the tools they use.
Establishing Industry Standards
Healthcare organizations should demand transparency from EHR vendors regarding data governance practices and advocate for industry-wide standards governing the use of AI training data. Policymakers should revise malpractice law to support quality improvement rather than obstruct it. Patients should be empowered to direct how their data is used and where any compensation flows.
The creator economy has begun to establish principles for compensating those whose work trains AI. Healthcare must now do the same, with a schema that respects patient autonomy, supports organizational stewardship, and prioritizes harm prevention over fear of litigation.
Sources
Goodman, K.E., & Morgan, D.J. (2026, January 8). Digital Exhaust or Digital Gold? The Value of AI-Generated Clinical Visit Transcripts. New England Journal of Medicine, 394(2), 110-113.
Palmer, S. (2026, January 16). AI Pays Wikipedia. Shelly Palmer. https://shellypalmer.com/2026/01/ai-pays-wikipedia/
Chaiken, B.P. (2024). Future Healthcare 2050, Chapter 9: “Enhancing Quality and Safety”
Chaiken, B.P. (2024). Future Healthcare 2050, Chapter 14: “Aligning AI with Human Needs”
🌐 Explore more insights: https://barrychaiken.com/sl
#HealthcareAI #PatientPrivacy #DigitalHealth #HealthIT #AIGovernance #HealthcareLeadership #FutureOfHealthcare