arXiv Open Access 2025

Context Matters: Comparison of commercial large language tools in veterinary medicine

Tyler J Poore Christopher J Pinard Aleena Shabbir Andrew Lagree Andre Telfer +1 lainnya

Lihat Sumber

Abstrak

Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.

Topik & Kata Kunci

cs.CL cs.AI

Penulis (6)

Tyler J Poore

Christopher J Pinard

Aleena Shabbir

Andrew Lagree

Andre Telfer

Kuan-Chuen Wu

Format Sitasi

APA MLA BibTeX

Poore, T.J., Pinard, C.J., Shabbir, A., Lagree, A., Telfer, A., Wu, K. (2025). Context Matters: Comparison of commercial large language tools in veterinary medicine. https://arxiv.org/abs/2510.01224

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓