ChatGPT could serve as an academic reference tool for early-career radiologists and researchers, suggest findings published on 10 December in Current Problems in Diagnostic Radiology.
The large language model (LLM) for the most part recommended appropriate machine learning and deep learning algorithms for various radiology tasks, wrote researchers led by Dr. Dania Daye, PhD, from Massachusetts General Hospital in Boston. These include segmentation, classification, and regression in medical imaging. However, it also showed mixed results for offering model diversity and choosing a gold standard.
“Its ability to bridge the knowledge gap in AI implementation could democratize access to advanced technologies, fostering innovation and improving radiology research quality,” Daye and colleagues wrote.
OpenAI released ChatGPT-4o in May 2024. This latest iteration of ChatGPT analyzes and generates responses for audio, video, and text prompts. Radiologists continue to explore the potential of LLMs to aid in their workflows. And while many have limited skills to access machine learning and deep learning algorithms, the researchers suggested that LLMs could serve as virtual advisers by guiding researchers in selecting the most appropriate AI models for their studies.
The Daye team evaluated ChatGPT-4o’s performance in recommending appropriate AI implementation in radiology research. The LLM recommended algorithms based on specific details provided by researchers. These included dataset characteristics, modality types, data sizes, and research objectives.
The researchers prompted GPT-4o 30 times with different tasks, imaging modalities, targets, and dataset sizes. They noted that these covered the most common use cases in medical AI.
Four graders rated LLM responses based on criteria such as response clarity, alignment with the specified task, model diversity in recommendations, and the selection of an appropriate gold standard baseline.
While GPT-4o mostly generated clear responses that aligned with researchers’ tasks, it struggled to diversify its algorithm suggestions or pick a gold standard approach.
The researchers highlighted the following findings:
- Graders rated an average of 83% of GPT-4o responses as clear.
- Graders rated 79% of responses as appropriate for aligning with research tasks.
- Graders rated 59% and 54% of responses as appropriate for GPT-4o’s recommendations on AI model diversity and choosing a gold standard approach, respectively.
- For the four criteria, GPT4-o generated wholly inappropriate responses with an average of 4.2% for response clarity, 5.8% for task alignment, 4.2% for model diversity, and 16% for gold standard selection, respectively.
The study authors highlighted that LLMs show promise as a support tool for radiologists and medical researchers starting work with AI algorithms. However, they cautioned that researchers should be wary of their limitations in model diversity and gold standard selection.
“By understanding these strengths and weaknesses, the medical research community can better leverage GPT-4o and similar tools to enhance AI-driven research in radiology,” the authors wrote.
The full study can be found here.