DIAGNOSTIC AND THERAPEUTIC ACCURACY OF LARGE LANGUAGE MODELS IN EMERGENCY DIGESTIVE SURGERY: A PRELIMINARY EVALUATION USING THE ARTIFICIAL INTELLIGENCE PERFORMANCE INSTRUMENT (AIPI) SCORING SYSTEM

Slot ID

366-07

Abstract Title

Author Details

No. of Authors

4

Including the presenting author

Author 1

Marie Desonniaux marie.desonniaux@skynet.be University of Liège Faculty of Medicine Liège Belgium *

Author 2

Tommaso Federico Coppola coppolafederico@live.it Humanitas Gavazzeni University Hospital 2. Department of Minimally Invasive General and Oncologic Surgery Bergamo Italy

Author 3

Jerome R. Lechien Jerome.LECHIEN@umons.ac.be University of Mons Department of Surgery, Faculty of Medicine Mons Belgium

Author 4

Giovanni Dapri giovanni@dapri.net Humanitas Gavazzeni University Hospital 2. Department of Minimally Invasive General and Oncologic Surgery Bergamo Italy

Author 5

Author 6

Author 7

Author 8

Author 9

Author 10

Author 11

Author 12

Presenting Author First Name

Marie

Presenting Author Last Name

Desonniaux

Presenting Author Email

marie.desonniaux@skynet.be

Presenting Author Country

Belgium

Abstract

Abstract type

Oral or Poster

Introduction *

Large Language Models (LLMs) including ChatGPT-4, Claude Sonnet 4, and DeepSeek have gained interest as decision-support tools in medical education. Their performance in high-acuity contexts like emergency digestive surgery, however, had remained under-evaluated. This study presented a preliminary evaluation of the diagnostic and therapeutic performance of three LLMs in emergency digestive surgery using the Artificial Intelligence Performance Instrument (AIPI) and aimed to determine their potential to assist junior surgical trainees in diagnostic reasoning and management planning.

Material & Method *

Data from twenty emergency digestive surgery cases were collected prospectively between May and July 2025. Each case was independently submitted to ChatGPT-4, Claude Sonnet 4, and DeepSeek using identical standardized prompts. Responses were scored independently using AIPI across four dimensions: primary diagnosis, differential diagnosis, investigations, and treatment. An independent blinded evaluation by expert surgeons is currently being conducted for comparative analysis.

Results *

All models correctly identified the primary diagnosis in 14/20 cases (70%). Claude Sonnet 4 and ChatGPT-4 achieved 70% accuracy in differential diagnoses, and DeepSeek slightly higher at 75%. Regarding complementary investigations, Claude 4 scored highest (90%), followed by ChatGPT-4 (85%) and DeepSeek (80%). Treatment recommendations were appropriate in 85% of cases for DeepSeek, and 80% for the other two models.

Conclusion *

This preliminary evaluation suggests that LLMs provide clinically relevant support in emergency surgical decision-making. Their consistent performance across diagnostic and therapeutic tasks indicates potential value in medical education. Ongoing expert validation will clarify their role in complex cases.

File Upload #1

Only accept images in .jpg or .png format. The image size must not exceed 1 MB.

File Upload #2

Only accept images in .jpg or .png format. The image size must not exceed 1 MB.

International Society of Surgery (ISS)

Société Internationale de Chirurgie (SIC)

DIAGNOSTIC AND THERAPEUTIC ACCURACY OF LARGE LANGUAGE MODELS IN EMERGENCY DIGESTIVE SURGERY: A PRELIMINARY EVALUATION USING THE ARTIFICIAL INTELLIGENCE PERFORMANCE INSTRUMENT (AIPI) SCORING SYSTEM marie.desonniaux@skynet.be