Skip to content
Breaking News

Breaking News

  • Home
  • World
  • Business
  • Health
  • Entertainment
  • Life Style
  • Sports
  • Toggle search form

Op-ed: How well can AI chatbots mimic doctors in a treatment setting? We put 5 to the test

Posted on July 18, 2024 By Admin No Comments on Op-ed: How well can AI chatbots mimic doctors in a treatment setting? We put 5 to the test


Dr. Scott Gottlieb is a physician and served as the 23rd Commissioner of the U.S. Food and Drug Administration. He is a CNBC contributor and is a member of the boards of Pfizer and several other startups in health and tech. He is also a partner at the venture capital firm New Enterprise Associates. Shani Benezra is a senior research associate at the American Enterprise Institute and a former associate producer at CBS News’ Face the Nation.

Many consumers and medical providers are turning to chatbots, powered by large language models, to answer medical questions and inform treatment choices. We decided to see whether there were major differences between the leading platforms when it came to their clinical aptitude.

To secure a medical license in the United States, aspiring doctors must successfully navigate three stages of the U.S. Medical Licensing Examination (USMLE), with the third and final installment widely regarded as the most challenging. It requires candidates to answer about 60% of the questions correctly, and historically, the average passing score hovered around 75%.

When we subjected the major large language models (LLMs) to the same Step 3 examination, their performance was markedly superior, achieving scores that significantly outpaced many doctors.

But there were some clear differences between the models.

Typically taken after the first year of residency, the USMLE Step 3 gauges whether medical graduates can apply their understanding of clinical science to the unsupervised practice of medicine. It assesses a new doctor’s ability to manage patient care across a broad range of medical disciplines and includes both multiple-choice questions and computer-based case simulations.

We isolated 50 questions from the 2023 USMLE Step 3 sample test to evaluate the clinical proficiency of five different leading large language models, feeding the same set of questions to each of these platforms — ChatGPT, Claude, Google Gemini, Grok and Llama.

Other studies have gauged these models for their medical proficiency, but to our knowledge, this is the first time these five leading platforms have been compared in a head-to-head evaluation. These results could give consumers and providers some insights on where they should be turning.

Here’s how they scored:

  • ChatGPT-4o (Open AI) — 49/50 questions correct (98%)
  • Claude 3.5 (Anthropic) — 45/50 (90%)
  • Gemini Advanced (Google) — 43/50 (86%)
  • Grok (xAI) — 42/50 (84%)
  • HuggingChat (Llama) — 33/50 (66%)

In our experiment, OpenAI’s ChatGPT-4o emerged as the top performer, achieving a score of 98%. It provided detailed medical analyses, employing language reminiscent of a medical professional. It not only delivered answers with extensive reasoning, but also contextualized its decision-making process, explaining why alternative answers were less suitable.

Claude, from Anthropic, came in second with a score of 90%. It provided more human-like responses with simpler language and a bullet-point structure that might be more approachable to patients. Gemini, which scored 86%, gave answers that weren’t as thorough as ChatGPT or Claude, making its reasoning harder to decipher, but its answers were succinct and straightforward.

Grok, the chatbot from Elon Musk’s xAI, scored a respectable 84% but didn’t provide descriptive reasoning during our analysis, making it hard to understand how it arrived at its answers. While HuggingChat — an open-source website built from Meta’s Llama — scored the lowest at 66%, it nonetheless showed good reasoning for the questions it answered correctly, providing concise responses and links to sources.

One question that most of the models got wrong related to a 75-year-old woman with a hypothetical heart condition. The question asked the physicians which was the most appropriate next step as part of her evaluation. Claude was the only model that generated the correct answer.

Another notable question, focused on a 20-year-old male patient presenting with symptoms of a sexually transmitted infection. It asked physicians which of five choices was the appropriate next step as part of his workup. ChatGPT correctly determined that the patient should be scheduled for HIV serology testing in three months, but the model went further, recommending a follow-up examination in one week to ensure that the patient’s symptoms had resolved and that the antibiotics covered his strain of infection. To us, the response highlighted the model’s capacity for broader reasoning, expanding beyond the binary choices presented by the exam.

These models weren’t designed for medical reasoning; they’re products of the consumer technology sector, crafted to perform tasks like language translation and content generation. Despite their non-medical origins, they’ve shown a surprising aptitude for clinical reasoning.

Newer platforms are being purposely built to solve medical problems. Google recently introduced Med-Gemini, a refined version of its previous Gemini models that’s fine-tuned for medical applications and equipped with web-based searching capabilities to enhance clinical reasoning.

As these models evolve, their skill in analyzing complex medical data, diagnosing conditions and recommending treatments will sharpen. They may offer a level of precision and consistency that human providers, constrained by fatigue and error, might sometimes struggle to match, and open the way to a future where treatment portals can be powered by machines, rather than doctors.



Source link

Health Tags:Alphabet Inc, Breaking News: Technology, Business, business news, Health care industry, Meta Platforms Inc, Pfizer Inc, Science, technology, Tempus AI Inc, United States

Post navigation

Previous Post: Jennifer Lopez forces Ben Affleck to ‘play along’ amid marital woes: Source
Next Post: Police comb through Thomas Matthew Crooks’ van that hid explosives, video shows

More Related Articles

Exclusive: Understanding Menopause – Age, Symptoms And Life Afterwards Health
International Youth Day 2024: Nurturing Mental Well-Being In Indian Youth; A Path Through Meditation Health
Study examines connection of language with likeliness of stroke in humans Health
Good news, gone bad | The Express Tribune Health
Judge Orders Timely Housing for Migrant Children Waiting at Border Health
F.D.A. Approves First U.S. Over-the-Counter Birth Control Pill Health

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Angel City’s Savy King undergoes heart surgery after frightening collapse during match
  • The internet is obsessed with these ‘ugly cute’ dolls reselling for over $300
  • Kate Middleton called ‘amazing advocate’ of British fashion
  • Nissan to cut 11,000 more jobs and shut seven factories
  • American hostage Edan Alexander released by Hamas after more than 580 days in captivity

Categories

  • Business
  • Entertainment
  • Health
  • Life Style
  • Sports
  • World

Copyright © 2025 Breaking News.

Powered by PressBook Blog WordPress theme