Skip to main content
Advertisement
  • Loading metrics

Large language models in medicine: A review of current clinical trials across healthcare applications

  • Mahmud Omar ,

    Mahmudomar70@gmail.com

    Affiliations Maccabi Health Services, Israel, The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, United States of America

  • Girish N. Nadkarni,

    Roles Conceptualization, Supervision, Validation

    Affiliations The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, United States of America, The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, United States of America

  • Eyal Klang ,

    Contributed equally to this work with: Eyal Klang, Benjamin S. Glicksberg

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Supervision, Writing – review & editing

    Affiliations The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, United States of America, The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, United States of America

  • Benjamin S. Glicksberg

    Contributed equally to this work with: Eyal Klang, Benjamin S. Glicksberg

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing

    Affiliations The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, United States of America, The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, United States of America

Abstract

This review analyzes current clinical trials investigating large language models’ (LLMs) applications in healthcare. We identified 27 trials (5 published and 22 ongoing) across 4 main clinical applications: patient care, data handling, decision support, and research assistance. Our analysis reveals diverse LLM uses, from clinical documentation to medical decision-making. Published trials show promise but highlight accuracy concerns. Ongoing studies explore novel applications like patient education and informed consent. Most trials occur in the United States of America and China. We discuss the challenges of evaluating rapidly evolving LLMs through clinical trials and identify gaps in current research. This review aims to inform future studies and guide the integration of LLMs into clinical practice.

Introduction

Large language models (LLMs) are artificial intelligence (AI) systems trained on vast text data to understand and generate human-like language [1]. This technology emerged as a particularly important recent innovation and are likewise being evaluated in medical practice and research [14]. There have been many studies that show promise for LLMs, such as GPT (generative pre-trained transformer), BERT (bidirectional encoder representations from transformers), in healthcare for patient interaction, administrative tasks, data analysis [57], and beyond. However, LLMs can produce inaccurate information and have been shown to propagate bias raising concerns about their use in clinical settings [1,8].

Like with all potential machine learning-based tools, clinical trials are needed to evaluate the effectiveness and safety of LLMs in real-world medical applications. These trials are research studies that assess new interventions, including technologies like LLMs, in clinical workflows [9]. These trials follow standardized protocols and are essential for validating healthcare innovations before widespread adoption [10].

Recent papers on LLMs in healthcare often present conflicting results [11]. Some clinical trials have shown some possible disparities compared to nonclinical studies. For example, while Liu and colleagues summarized that GPT is effective for clinical documentation [12] and Barrak and colleagues reported positive perceptions among pediatric emergency medicine attendings [13], a randomized controlled trial (RCT) by Baker and colleagues revealed errors in 36% of documents [14]. We will focus solely on registered clinical trials to maintain coherence, despite the valuable evidence some nonclinical trials offer regarding the potential benefits, biases, and inaccuracies of LLMs [1,8].

This review analyzes registered clinical trials exploring LLM applications in medicine. We focus on LLMs due to the growing research interest in their potential impact on healthcare [1]. Our analysis covers trial designs, applications, and outcomes to identify trends and gaps in current LLM research. This review aims to inform future studies and guide the integration of LLMs into clinical practice and research.

Methodology

Search strategy and selection criteria

We systematically screened publications from January 2018, when the first public LLM debuted. Our search terms included “clinical trials,” “Large Language Models,” “LLMs,” “GPT,” and “BERT.” We used PubMed, Scopus, Embase, ClinicalTrials.gov, and the International Clinical Trials Registry Platform (ICTRP) to find published and ongoing research. We used the Rayyan web application [15], a tool designed for streamlined screening of academic papers. Our review follows the PRISMA extension for scoping reviews guidelines [16]. The Boolean strings used for database searches are in the Supporting information S1 Text.

Inclusion and exclusion criteria

We included registered clinical trials and RCTs that evaluated LLMs in clinical practice or research. This encompassed both published trials and registered, unpublished clinical trials. We selected trials where LLMs were the primary intervention. We define LLMs as neural network-based models trained on large text data sets to generate human-like text [1]. We did not include trials using linear models (e.g., logistic regression) or other machine learning models (e.g., decision trees, non-LLM neural networks).

Screening and data extraction

The initial screening of titles and abstracts was conducted by 2 independent researchers, MO and EK. Following the initial screening, full-text articles and records of registered and currently running clinical trials deemed potentially eligible were retrieved and assessed. Data extraction was performed by one reviewer (MO) and subsequently verified by a second reviewer (EK or BG). Any discrepancies encountered during the data extraction phase were resolved through discussion among the reviewers.

Overview of the included trials and applications

Our review encompasses 27–5 published and 22 ongoing—clinical trials employing LLMs across various healthcare applications (Tables 13) [14,1742]. Fig 1 represents the screening process using the PRISMA flowchart.

Despite the diverse and sometimes unique applications of these trials, they can broadly be grouped into 4 categories: Patient Care (11 trials), Data Handling (4 trials), Decision and Diagnostics Aid (8 trials), and Research Assistance (4 trials).

Patient Care encompasses any use-case directly oriented towards patients, such as management or patient education. Data Handling focuses on applications involving data analysis, storage, and related activities. Decision and Diagnostics Aid covers the diagnosis and detection of diseases. Research Assistance pertains to applications related to proofreading, writing, or reviewing research materials.

Concerning the models assessed, all published trials identified the specific models evaluated. Three trials focused on GPT-4, one on GPT-3 alongside other models such as Llama and PALM, and one on BERT. In the ongoing trials, 4 did not specify which LLM would be used. Fifteen trials employed various iterations of GPT, with 7 specifying the use of GPT-4. The remaining trials involved other models, including digital twin and BERT (S1 and S2 Tables).

Published clinical trials

It is important to note that not all clinical trial results are published. Often, only trials that confirm researchers’ assumptions or show significant results are published [43].

The review identified 5 published clinical trials (Tables 1 and 2) which examined different applications of LLMs in healthcare: 1 focused on patient care [20], another on data handling [14], 1 on decision and diagnostic aid [18], and 2 on research assistance [17,19]. While the trials were broadly categorized, they encompass a wide range of specific applications that can vary significantly even within the same category. Specifically, these trials explored applications in clinical documentation [14], medical decision-making [18], and patient knowledge enhancement tools [20]. These trials are conducted in many countries, including USA, Italy, Denmark, and Saudi Arabia, employing models such as GPT-4 and BERT (Fig 2).

thumbnail
Fig 2. Trends in the clinical trials investigating LLMs—Years and countries.

The years in the above figure represents either the year of publication (in the published trials) or the year of registration (in the running trials). The phrase “Not recruiting” conveys trials that are registered but have not started recruiting.

https://doi.org/10.1371/journal.pdig.0000662.g002

In terms of outcomes, collectively, these trials showcase where LLMs exhibit promising capabilities, albeit with some limitations. For example, Baker and colleagues explored the utility of GPT-4 in enhancing clinical documentation quality in a single-center trial in the USA with 11 participants. They reported that while GPT-4 produced more detailed patient histories, inaccuracies were noted in 36% of cases. In parallel, several ongoing trials are delving into similar realms of LLM application in medical documentation [14]. Notably, NCT06263855, also in the USA, is targeting a larger cohort of 1,015 participants, examining whether LLM-assisted writing of discharge summaries can improve care delivery. Meanwhile, ChiCTR2300078274 in China is exploring the use of ChatGPT for informed consent in knee arthroplasty, and NCT05945004 is comparing the efficacy of ChatGPT with human efforts in drafting preoperative visit sheets.

In Italy, Civettini and colleagues evaluated LLMs for decision-making in hematopoietic stem cell transplantation in a multicenter Italian study with 6 participants [18]. They tested GPT-4, PaLm 2, and 2 versions of Llama-2 against medical residents and expert consensus. LLMs achieved a median overall agreement of 58.8% with expert consensus, with kappa values between 0.3 and 0.61. Residents outperformed LLMs, showing 76.5% agreement and kappa values of 0.4 to 0.8. No current trials specifically examine LLMs in stem cell transplantation decisions. However, studies like DRKS00033775 (Germany) and NCT06157944 (USA) are exploring LLMs as diagnostic aids for doctors, focusing on their assistive role rather than standalone capabilities.

Deveci assessed the capabilities of GPT-4 in writing cover letters for scientific submissions in a single-center study in Denmark involving 36 participants [17]. The findings showed that GPT-4’s letters were comparable in impression and readability to those written by humans, with slight variances in meeting specific criteria.

Lawrence and colleagues revealed that AI-generated arthroplasty literature was indistinguishable in authorship discernibility from human-written texts, though it fell short in perceived quality [19]. Interestingly, there are no ongoing trials that are directly investigating the writing aspects of research assistance capabilities of LLMs.

Lastly, Bitar’s study (2022) in Saudi Arabia with 386 participants employed BERT to summarize texts about HPV, aiming to enhance knowledge dissemination [20]. The BERT-generated summaries were effective, though slightly less so than full texts. In a related vein, NCT05789901 is evaluating the provision of health condition information via a chatbot. Although other ongoing trials are investigating the capabilities of LLMs to provide accurate and usable medical information, their focus is primarily on healthcare experts rather than patients. For example, NCT05963802 in Canada are examining the AI’s usability and efficacy in health sciences training, and NCT06015178 in China is focused on enhancing medical researchers’ self-learning abilities.

Ongoing trials

Our review of clinical trial registries uncovered 22 registered trials currently exploring the applications of LLMs in healthcare (Table 3). These studies, registered between 2023 and 2024, are in various stages, with 9 actively recruiting participants (41% of the total), 7 have not yet started, and the remaining 6 are either ongoing or nearing completion. The trials are primarily conducted in China and the USA, accounting for 14 of the 22 trials (64%). Additional countries include Germany, Italy, and Canada (Fig 2). Sample sizes among these trials vary significantly, ranging from fewer than 100 to over 1,000 participants.

These trials predominantly utilize different models of GPT. For example, Dong and colleagues’ multicenter trial involves over 1,000 participants to assess LLMs’ effectiveness in providing decision support for gastrointestinal cancer treatments, assessing the practical utility of LLMs in complex medical decision-making scenarios.

Prominent use cases for LLMs in these trials include clinical decision support and enhancing patient care. For example, the study by Dong and colleagues evaluated LLMs in providing decision support for gastrointestinal cancer treatments, while Bitar and colleagues focused on using LLMs to enhance patient education about HPV. However, some trials featured unique and intriguing applications that highlight the innovative potential of LLMs. For example, Shalong and colleagues are investigating how a custom GPT can support self-directed learning among medical students, an innovative approach in medical education.

Another distinctive application by Zheng and colleagues explores using LLMs to improve informed decision-making during cataract surgery consultations, aiming to enhance patient understanding and satisfaction. Additionally, Yao and colleagues are assessing the impact of LLMs on discharge summary writing, which could revolutionize administrative tasks in healthcare by improving efficiency and accuracy.

The distribution of these trials provides insights into the current state of LLM integration into clinical research. There is a clear emphasis on employing LLMs for diagnostic and decision support, alongside an interest in using these models to augment medical education and patient care. Notably, the USA and China emerge as leading contributors to this field, signaling a drive towards AI adoption in their healthcare systems.

Discussion

Our review identified 22 ongoing and 5 published clinical trials evaluating LLMs in medicine. These trials aim to assess the effectiveness and safety of LLMs in real-world healthcare settings. The accuracy and reliability standards for LLM use in clinical practice remain undefined. Future research should establish clear acceptance criteria for LLMs in various medical applications.

The reviewed trials cover diverse applications, from clinical documentation to medical decision-making. For example, NCT06263855 examines LLM-assisted writing of discharge summaries with over 1,000 participants. However, only 3 out of 27 trials focus on direct patient education interaction. For clarity, this means that the evaluated models could be used and interacted with directly by patients for educational purposes. This suggests a need for more research in these areas to fully explore LLMs’ potential in improving patient outcomes.

A key barrier to conducting clinical trials on direct patient interactions with LLMs is the lack of HIPAA compliance in many of these technologies [44]. HIPAA regulations protect patient privacy and secure health information [45]. Without proper compliance, LLMs cannot be legally or ethically used to handle patient data in clinical settings [45]. This challenge needs to be addressed to enable more comprehensive research on LLMs in direct patient care.

Trial scales vary significantly. Some, like NCT06157944, are large multicenter studies investigating LLMs in diagnostic processes. Others are smaller, single-center studies. This variation limits our ability to draw generalized conclusions about LLM efficacy and safety across different clinical settings. More large-scale, multicenter trials could provide more robust evidence.

Many trials (15 out of 27) explore multiple LLM applications within a single study. For example, NCT05963802 in Canada evaluates LLMs in various aspects of health sciences training. While this approach offers a broad view of LLM capabilities, it may lack depth in specific applications. Targeted studies focusing on individual clinical tasks could provide more detailed insights into optimizing LLMs for specific healthcare functions.

An important question that should be discussed is how to design larger clinical trials, which typically require long durations and significant efforts [10], to evaluate LLMs that are rapidly evolving [1,3]. The current evidence suggests that newer models, developed in a very short time frame, have shown significantly better results across almost all use cases [4650].

This raises an important open issue: how can we design and execute effective, fast-paced clinical trials that can keep up with this swiftly advancing field, while maintaining the rigorous standards of robust investigative tools that are trusted and proven? Such a balance is essential to ensure that as LLMs develop, they are evaluated thoroughly and accurately, allowing healthcare to benefit from the latest advancements without sacrificing reliability.

The current state of clinical trials often lacks specificity regarding the LLMs used, with many focusing on various iterations of GPT. GPT is widely utilized and studied [51], and research consistently shows performance variations among its versions, such as GPT-3.5 and GPT-4 [48]. These performance differences are narrowing in more advanced models, illustrated by minor discrepancies between GPT-4 and its newer counterpart, GPT-4o [52]. This trend underscores the rapid development within the field, necessitating thorough and precise evaluation. Moreover, the field boasts a diverse array of models differing in parameters, design, and overall performance and safety [53].

Our review has several limitations. The databases selected might miss relevant studies not included in those sources. We included many yet-to-be-published studies to reflect the field’s rapid advancement, but this introduces uncertainty about the actual outcomes. Additionally, by excluding non-randomized studies and other AI models, we might have missed broader applications and impacts of AI in healthcare. As this field evolves quickly, there’s also a risk that the most recent studies were not included. Some registered clinical trials may not use robust, protocol-coherent designs typically accepted for clinical trials. However, we carefully screened each included study to ensure their designs are consistent with evaluating LLM interventions on human subjects.

In conclusion, LLMs offers a promise in healthcare but also requires more careful investigation and validation. Future directions should include expanding research into underexplored areas such as direct patient care and education, designing larger multicenter trials, and balancing broad-based LLM applications with targeted studies that probe defined and specific clinical tasks. Future trials should focus particularly on direct patient interactions and education—areas ripe for development but currently underexplored. Effective integration into clinical practice will require standardized protocols that ensure these models enhance, rather than compromise, the quality of care.

Supporting information

S1 Text. Full Booleans for the systematic literature search.

https://doi.org/10.1371/journal.pdig.0000662.s001

(DOCX)

S1 Table. A detailed summary of the included ongoing clinical trials.

https://doi.org/10.1371/journal.pdig.0000662.s002

(DOCX)

S2 Table. A detailed summary of the included published clinical trials.

https://doi.org/10.1371/journal.pdig.0000662.s003

(DOCX)

References

  1. 1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930–1940. pmid:37460753
  2. 2. Beam AL, Drazen JM, Kohane IS, Leong T-Y, Manrai AK, Rubin EJ. Artificial Intelligence in Medicine. N Engl J Med. 2023;388:1220–1221. pmid:36988598
  3. 3. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt J-N, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med (Lond). 2023;3:141. pmid:37816837
  4. 4. Meng X, Yan X, Zhang K, Liu D, Cui X, Yang Y, et al. The application of large language models in medicine: A scoping review. iScience. 2024;27. pmid:38746668
  5. 5. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. pmid:37215063
  6. 6. Waisberg E, Ong J, Masalkhi M, Kamran SA, Zaman N, Sarker P, et al. GPT-4: a new era of artificial intelligence in medicine. Ir J Med Sci. 2023;192:3197–3200. pmid:37076707
  7. 7. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction—PubMed. [cited 2024 Apr 22]. https://pubmed.ncbi.nlm.nih.gov/34017034/.
  8. 8. Omiye JA, Lester JC, Spichak S, Rotemberg V, Daneshjou R. Large language models propagate race-based medicine. NPJ Digit Med. 2023;6:195. pmid:37864012
  9. 9. Kandi V, Vadakedath S. Clinical Trials and Clinical Research: A Comprehensive Review. Cureus. 2023;15:e35077. pmid:36938261
  10. 10. Umscheid CA, Margolis DJ, Grossman CE. Key Concepts of Clinical Trials: A Narrative Review. Postgrad Med. 2011;123:194–204. pmid:21904102
  11. 11. Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls: A Narrative Review. Ann Intern Med. 2024;177:210–220. pmid:38285984
  12. 12. Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res. 2023;25:e48568. pmid:37379067
  13. 13. Barak-Corren Y, Wolf R, Rozenblum R, Creedon JK, Lipsett SC, Lyons TW, et al. Harnessing the Power of Generative AI for Clinical Summaries: Perspectives From Emergency Physicians. Ann Emerg Med. 2024;S0196-0644(24)00078–7. pmid:38483426
  14. 14. Baker HP, Dwyer E, Kalidoss S, Hynes K, Wolf J, Strelzow JA. ChatGPT’s Ability to Assist with Clinical Documentation: A Randomized Controlled Trial. J Am Acad Orthop Surg. 2024;32:123–129. pmid:37976385
  15. 15. Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan-a web and mobile app for systematic reviews. Syst Rev. 2016;5:210. pmid:27919275
  16. 16. Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med. 2018;169:467–473. pmid:30178033
  17. 17. Deveci CD, Baker JJ, Sikander B, Rosenberg J. A comparison of cover letters written by ChatGPT-4 or humans. Dan Med J. 2023;70:A06230412. pmid:38018708
  18. 18. Civettini I, Zappaterra A, Granelli BM, Rindone G, Aroldi A, Bonfanti S, et al. Evaluating the performance of large language models in haematopoietic stem cell transplantation decision-making. Br J Haematol. 2024;204:1523–1528. pmid:38070128
  19. 19. Lawrence KW, Habibi AA, Ward SA, Lajam CM, Schwarzkopf R, Rozell JC. Human versus artificial intelligence-generated arthroplasty literature: A single-blinded analysis of perceived communication, quality, and authorship source. Int J Med Robot. 2024;20:e2621. pmid:38348740
  20. 20. Bitar H, Babour A, Nafa F, Alzamzami O, Alismail S. Increasing Women’s Knowledge about HPV Using BERT Text Summarization: An Online Randomized Study. Int J Environ Res Public Health. 2022;19:8100. pmid:35805761
  21. 21. Lebouche DB. A Master Research Protocol to Adapt and Evaluate an Artificial Intelligence Based Conversational Agent to Provide Information for Different Health Conditions: the MARVIN Chatbots. clinicaltrials.gov; 2024 Mar. Report No.: NCT05789901. https://clinicaltrials.gov/study/NCT05789901.
  22. 22. Lin H. A Randomized Controlled Trial of the Effects of a Large Language Model on Medical Students’ Clinical Questioning Skills. clinicaltrials.gov; 2024 Jan. Report No.: NCT06229379. https://clinicaltrials.gov/study/NCT06229379.
  23. 23. Zhongshan Ophthalmic Center, Sun Yat-sen University. A Superiority Randomized Controlled Trial of the Effect of a Novel Intelligent Language Model on the Self-learning Ability of Medical Researchers. clinicaltrials.gov; 2023 Nov. Report No.: NCT06015178. https://clinicaltrials.gov/study/NCT06015178.
  24. 24. Dong D. Application of Large Language Models in the Recommendation of Treatment Plans for Gastrointestinal Cancers. clinicaltrials.gov; 2023 Sep. Report No.: NCT06002425. https://clinicaltrials.gov/study/NCT06002425.
  25. 25. Veras M. Crossover Randomized Controlled Trial to Evaluate the Efficacy and Usability of Artificial Intelligence (ChatGPT) for Health Sciences Students (AIHSS). clinicaltrials.gov; 2024 Feb. Report No.: NCT05963802. https://clinicaltrials.gov/study/NCT05963802.
  26. 26. Chen J. Diagnostic Reasoning With Large Language Model Chat Bots. clinicaltrials.gov; 2024 Feb. Report No.: NCT06157944. https://clinicaltrials.gov/study/NCT06157944.
  27. 27. Yao X. Effect of Large Language Model in Assisting Discharge Summary Notes Writing for Hospitalized Patients: A Pilot Pragmatic Randomized Controlled Trial. clinicaltrials.gov; 2024 Apr. Report No.: NCT06263855. https://clinicaltrials.gov/study/NCT06263855.
  28. 28. Zheng Y. Effectiveness of Using Interactive Consulting System Based on Large Language Model to Enhance Informed Choice of Cataract Patients: a Non-inferiority Randomized Controlled Trial. clinicaltrials.gov; 2023 Oct. Report No.: NCT04246346. https://clinicaltrials.gov/study/NCT04246346.
  29. 29. Zheng Y. Efficacy of Using Large Language Model to Assist in Diabetic Retinopathy Detection. clinicaltrials.gov; 2024 Jan. Report No.: NCT05231174. https://clinicaltrials.gov/study/NCT05231174.
  30. 30. Shalong W. Enhancement of Self-Directed Learning Through Custom GPT’s AI Facilitation Among Medical Students: An Open-label, Randomized Controlled Trial. clinicaltrials.gov; 2024 Feb. Report No.: NCT06276049. https://clinicaltrials.gov/study/NCT06276049.
  31. 31. Turan EI. Evaluation of the Success of ChatGPT-4 in Predicting Postoperative Intensive Care Needs and Mortality: Prospective Observational Study. clinicaltrials.gov; 2024 Mar. Report No.: NCT06321328. https://clinicaltrials.gov/study/NCT06321328.
  32. 32. National Taiwan University Hospital. Generating Fast and Slow for Entree Level Medical Knowledge. clinicaltrials.gov; 2024 Feb. Report No.: NCT06247475. https://clinicaltrials.gov/study/NCT06247475.
  33. 33. German Clinical Trials Register. [cited 2024 May 5]. https://drks.de/search/en/trial/DRKS00032895.
  34. 34. ICTRP Search Portal. [cited 2024 May 5]. https://trialsearch.who.int/Trial2.aspx?TrialID=NCT06346496.
  35. 35. ICTRP Search Portal. [cited 2024 May 5]. https://trialsearch.who.int/Trial2.aspx?TrialID=ChiCTR2400081938.
  36. 36. ICTRP Search Portal. [cited 2024 May 5]. https://trialsearch.who.int/Trial2.aspx?TrialID=DRKS00033775.
  37. 37. ICTRP Search Portal. [cited 2024 May 5]. https://trialsearch.who.int/Trial2.aspx?TrialID=ChiCTR2300078274.
  38. 38. ICTRP Search Portal. [cited 2024 May 5]. https://trialsearch.who.int/Trial2.aspx?TrialID=ChiCTR2300071774.
  39. 39. ICTRP Search Portal. [cited 2024 May 5]. https://trialsearch.who.int/Trial2.aspx?TrialID=JPRN-UMIN000050398.
  40. 40. Chen J. Management Reasoning With AI Chat Bots. clinicaltrials.gov; 2024 Feb. Report No.: NCT06208423. https://clinicaltrials.gov/study/NCT06208423.
  41. 41. PremalPatel. Real World Utility of ChatGPT in Pre-vasectomy Counselling in an Office-based Setting. clinicaltrials.gov; 2023 Aug. Report No.: NCT06009783. https://clinicaltrials.gov/study/NCT06009783.
  42. 42. Boston Intelligent Medical Research Center, Shenzhen United Scheme Technology Co., Ltd. Using Natural Language Processing Models for Writing Preoperative Visit Sheets: a Preliminary Study Comparing ChatGPT and Clinicians. clinicaltrials.gov; 2023 Jul. Report No.: NCT05945004. https://clinicaltrials.gov/study/NCT05945004.
  43. 43. Schroter S, Price A, Malički M, Richards T, Clarke M. Frequency and format of clinical trial results dissemination to patients: a survey of authors of trials indexed in PubMed. BMJ Open. 2019;9:e032701. pmid:31636111
  44. 44. Theodos K, Sittig S. Health Information Privacy Laws in the Digital Age: HIPAA Doesn’t Apply. Perspect Health Inf Manag. 2020;18:1l. pmid:33633522
  45. 45. Edemekong PF, Annamaraju P, Haydel MJ. Health Insurance Portability and Accountability Act. StatPearls. Treasure Island (FL): StatPearls Publishing; 2024. http://www.ncbi.nlm.nih.gov/books/NBK500019/.
  46. 46. Meyer A, Riese J, Streichert T. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Med Educ. 2024;10:e50965. pmid:38329802
  47. 47. Moshirfar M, Altaf AW, Stoakes IM, Tuttle JJ, Hoopes PC. Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions. Cureus. 2023;15:e40822. pmid:37485215
  48. 48. Katz U, Cohen E, Shachar E, Somer J, Fink A, Morse E, et al. GPT versus Resident Physicians—A Benchmark Based on Official Board Scores. NEJM AI. 2024;1:AIdbp2300192.
  49. 49. Rizzo MG, Cai N, Constantinescu D. The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education. J Orthop. 2024;50:70–75. pmid:38173829
  50. 50. Li D, Gupta K, Bhaduri M, Sathiadoss P, Bhatnagar S, Chong J. Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases. Radiology. 2024;310:e232411. pmid:38226874
  51. 51. Jeyaraman M, Ramasubramanian S, Balaji S, Jeyaraman N, Nallakumarasamy A, Sharma S. ChatGPT in action: Harnessing artificial intelligence potential and addressing ethical challenges in medicine, education, and scientific research. World J Methodol. 2023;13:170–178. pmid:37771867
  52. 52. Hirano Y, Hanaoka S, Nakao T, Miki S, Kikuchi T, Nakamura Y, et al. No improvement found with GPT-4o: results of additional experiments in the Japan Diagnostic Radiology Board Examination. Jpn J Radiol. 2024. pmid:38937409
  53. 53. Sonoda Y, Kurokawa R, Nakamura Y, Kanzawa J, Kurokawa M, Ohizumi Y, et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn J Radiol. 2024. pmid:38954192