Paper-to-Podcast

Paper Summary

Title: Putting ChatGPT’s Medical Advice to the (Turing) Test: Survey Study

Source: JMIR Medical Education

Authors: Oded Nov et al.

Published Date: 2023-01-01

Copy RSS Feed Link

Podcast Transcript

Hello, and welcome to paper-to-podcast. I have read 100 percent of the paper we're discussing today, which is titled "Putting ChatGPT’s Medical Advice to the (Turing) Test: Survey Study" published in JMIR Medical Education by Oded Nov and colleagues.

Now, if you've ever wondered whether you could be fooled by a robot doctor, you're not alone. The authors of this paper put an artificial intelligence chatbot, ChatGPT, to the test by disguising its responses as those of a human healthcare provider. And the results, my friends, were surprising, slightly hilarious, and a bit unsettling.

Out of 392 people surveyed, they could only correctly identify the chatbot responses about 65.5% of the time, and the human responses 65.1% of the time. That's right, folks, it's almost a coin toss! But here's the catch, despite this confusion, people still trusted the chatbot to answer lower-risk health questions, with a lukewarm trust score of 3.4 out of 5. As the complexity of the health-related task increased, however, trust in our robotic friends dipped. It seems we're okay with a chatbot telling us to stay hydrated, but when it comes to the serious stuff, we prefer a human touch.

Now, the method behind this madness was simple. The researchers took 10 random patient-doctor interactions, removed all patient info, and had ChatGPT answer the same questions. They made a survey where each question was followed by either a doctor's response or a chatbot's response, and participants had to guess who wrote it. They were told that five answers were from a doctor and five were from a chatbot. The order of questions and answers was randomized to keep things fair and participants were incentivized for correct guesses.

The strengths of this study lie in its relevance to the emerging field of AI-based chatbots in healthcare and a solid experimental design that promotes independent decision-making by participants. However, the study wasn't without its limitations. For instance, the chatbot used wasn’t trained on medical data and there was no specialized prompting of ChatGPT to be more empathetic. Also, the survey format might have carried some bias, as participants knew beforehand the source of half the responses.

Despite these limitations, the potential applications of this research are exciting. The advent of AI chatbots like ChatGPT could potentially ease the burden on healthcare providers by automating responses to patient queries, particularly for managing common chronic diseases. But we need to remember, as promising as these applications are, they require careful implementation and continuous monitoring. So, until we get there, remember: a chatbot might tell you to drink more water, but for that broken leg, you'll want a human doc.

You can find this paper and more on the paper2podcast.com website.

Supporting Analysis

Findings:
This study put a chatbot to the test, disguising its responses as those of a healthcare provider. The results were both surprising and slightly hilarious. Of the 392 people who took the survey, they were only able to correctly identify the chatbot responses 65.5% of the time, and the human responses 65.1% of the time. So, participants were just as clueless about whether they were chatting with a bot or a human! But, here's the kicker: even though people couldn't really tell if they were chatting with a chatbot, they still trusted it to answer lower-risk health questions. The trust score averaged out to a lukewarm 3.4 out of 5. The trust level dipped as the health-related complexity of the task increased. So, while people might not mind a chatbot telling them to drink more water, they'd rather leave the heavy medical advice to the human experts. It's like saying, "You're cool, Chatbot, but not 'performing-my-surgery' cool."

Methods:
Sure, let's break this down like a high school science experiment! The researchers wanted to test if people could tell the difference between responses to medical questions from a real doctor and an artificial intelligence (AI) chatbot called ChatGPT. They picked 10 random, non-administrative patient-doctor interactions from an electronic health record, removed all the patient info, and had ChatGPT answer the same questions the patients asked their doctors. They made sure the chatbot's answers were about as long as the doctors' answers. The researchers then created a survey where each patient question was followed by either a doctor's response or a chatbot's response and the participants had to guess who wrote it. The participants were told that five answers were from a doctor and five were from a chatbot. The order of questions and answers was randomized to keep things fair. Participants were paid to take the survey and got a bonus for each correct guess. They were also asked to rate how much they trusted chatbots in patient-doctor communication on a scale of 1 to 5.

Strengths:
The most compelling aspect of this research is its highly relevant exploration into the emerging field of artificial intelligence (AI)-based chatbots in healthcare. The study is particularly interesting as it examines how well patients can distinguish between a response given by a human provider and that of a chatbot, as well as their trust in the latter. Additionally, the researchers' method of using a sizable and diverse sample of participants, recruited from a crowdsourcing platform, adds credibility and generalizability to their findings. The researchers stuck to various best practices during the study. They removed any identifying details from patient-provider interactions for privacy and ethical considerations. The equal distribution of AI and human responses, and informing the participants about it, is a solid experimental design choice based on Fisher's seminal work. This strategy promotes independent decision-making by participants and avoids any influence that could sway preferences. They also ensured financial incentives were in place for correctly identifying the source of responses, which likely improved the accuracy of results. Finally, the use of a Likert scale for measuring trust in chatbots is a reliable method for gathering subjective data.

Limitations:
This research has a couple of "oh, snap" moments. First, the chatbot used (ChatGPT) wasn't trained on medical data, so it might be a tad inferior to other medically trained chatbots like the one with the cool name, Med-PaLM. Also, there was no specialized prompting of ChatGPT to be more empathetic, which might have made its responses sound more human and could have made patients more willing to accept its responses. Second, the study didn't account for individual style. Responses from both the human provider and chatbot were short and impersonal, but what if they weren't? Would this have impacted the ability to distinguish between them? Lastly, the web-based survey might have been a bit biased. Participants knew beforehand that 5 answers were from a human and 5 were from a chatbot. This prior knowledge could have influenced their responses. Plus, the study doesn't mention if the participants had any background in healthcare, which could affect their ability to distinguish between responses. It's like knowing the end of a movie before watching it, it kind of takes away the suspense!

Applications:
This research could potentially revolutionize the healthcare sector, especially in patient-provider communication. With the advent of AI chatbots like ChatGPT, there's a possibility of easing the burden on healthcare providers by automating responses to patient queries. This could be particularly useful for managing common chronic diseases like diabetes, asthma, or high blood pressure, especially when the questions are of a lower-risk nature. Additionally, as patients seem to trust AI responses, these chatbots could also be used to provide health advice. However, it's important to note that while these applications are promising, they require careful implementation and continuous monitoring to ensure that the advice given is accurate and beneficial to the patients. After all, we wouldn't want our chatbot doctors prescribing chicken soup for a broken leg!