The CARE Workshop on Robotics and AI in Medicine, held on December 1, 2025 in Indianapolis, convened leading researchers, clinicians, industry innovators, and federal stakeholders to shape a national vision for advancing robotics and artificial intelligence in healthcare. The event highlighted the accelerating need for coordinated research efforts that bridge engineering innovation with real clinical priorities, emphasizing safety, reliability, and translational readiness with an emphasis on the use of robotics and AI to achieve this readiness goal. Across keynotes, panels, and breakout sessions, participants underscored critical gaps in data availability, standardized evaluation methods, regulatory pathways, and workforce training that hinder the deployment of intelligent robotic systems in surgical, diagnostic, rehabilitative, and assistive contexts. Discussions emphasized the transformative potential of AI enabled robotics to improve precision, reduce provider burden, expand access to specialized care, and enhance patient outcomes particularly in undeserved regions and high risk procedural domains. Special attention was given to austere settings, disaster and relief and military settings. The workshop demonstrated broad consensus on the urgency of establishing a national Center for AI and Robotic Excellence in medicine (CARE). Stakeholders identified priority research thrusts including human robot collaboration, trustworthy autonomy, simulation and digital twins, multi modal sensing, and ethical integration of generative AI into clinical workflows. Participants also articulated the need for high quality datasets, shared test beds, autonomous surgical systems, clinically grounded benchmarks, and sustained interdisciplinary training mechanisms.
Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored. To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making. We evaluate diverse models, including international (e.g., GPT-4o), Chinese (e.g., InternLM), and medical-specific (e.g., PLUSE). Results show a performance hierarchy: all models have limitations in specialized subdomains like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs. Models with Chinese linguistic and cultural priors perform better in classical text interpretation and clinical reasoning. TCM-3CEval sets a standard for AI evaluation in TCM, offering insights for optimizing LLMs in culturally grounded medical domains. The benchmark is available on Medbench's TCM track, aiming to assess LLMs' TCM capabilities in basic knowledge, classic texts, and clinical decision-making through multidimensional questions and real cases.
Large Language Models (LLMs) are increasingly integrated into daily routines, yet they raise significant privacy and safety concerns. Recent research proposes collaborative inference, which outsources the early-layer inference to ensure data locality, and introduces model safety auditing based on inner neuron patterns. Both techniques expose the LLM's Internal States (ISs), which are traditionally considered irreversible to inputs due to optimization challenges and the highly abstract representations in deep layers. In this work, we challenge this assumption by proposing four inversion attacks that significantly improve the semantic similarity and token matching rate of inverted inputs. Specifically, we first develop two white-box optimization-based attacks tailored for low-depth and high-depth ISs. These attacks avoid local minima convergence, a limitation observed in prior work, through a two-phase inversion process. Then, we extend our optimization attack under more practical black-box weight access by leveraging the transferability between the source and the derived LLMs. Additionally, we introduce a generation-based attack that treats inversion as a translation task, employing an inversion model to reconstruct inputs. Extensive evaluation of short and long prompts from medical consulting and coding assistance datasets and 6 LLMs validates the effectiveness of our inversion attacks. Notably, a 4,112-token long medical consulting prompt can be nearly perfectly inverted with 86.88 F1 token matching from the middle layer of Llama-3 model. Finally, we evaluate four practical defenses that we found cannot perfectly prevent ISs inversion and draw conclusions for future mitigation design.
Interdisciplinary research is critical for innovation and addressing complex societal issues. We characterise the interdisciplinary knowledge structure of PubMed research articles in medicine as correlation networks of medical concepts and compare the interdisciplinarity of articles between high-ranking (impactful) and less high-ranking (less impactful) medical journals. We found that impactful medical journals tend to publish research that are less interdisciplinary than less impactful journals. Observing that they bridge distant knowledge clusters in the networks, we find that cancer-related research can be seen as one of the main drivers of interdisciplinarity in medical science. Using signed difference networks, we also investigate the clustering of deviations between high and low impact journal correlation networks. We generally find a mild tendency for strong link differences to be adjacent. Furthermore, we find topic clusters of deviations that shift over time. In contrast, topic clusters in the original networks are static over time and can be seen as the core knowledge structure in medicine. Overall, journals and policymakers should encourage initiatives to accommodate interdisciplinarity within the existing infrastructures to maximise the potential patient benefits from IDR.
Yasmine Aramon, Andrea Pieragostini, Pierre Jean Francin
et al.
Abstract The outcome of multiple myeloma (MM) has tremendously improved over the past decade, due to the development of efficient chemotherapy and mostly immunotherapy. Yet, some patients still display poor responses and outcome. This could be in part related to the presence of peripheral plasma cells, at levels not compatible with a diagnosis of plasma‐cell leukaemia. Some recent publications have highlighted the prognostic influence of low levels of such cells which is around 0.1%. Automated blood cell analysers now include fluorescent staining, allowing to identify cells with higher levels or nucleic acids such as activated B‐lymphocytes/plasma cells with highly active ribosomal activity related to antibody production. Here, a prospective evaluation of peripheral high fluorescent lymphoid cell (HFLC) levels was carried out on samples from patients with newly diagnosed MM, and data computed with regard to patient evolution. HFLC above a 0.1% threshold were identified as strongly correlated with poorer response in a cohort of 127 patients. The 74 patients with low HFLC had a significantly better PFS both in univariate (p = 0.0017) and multivariate (p = 0.0007) analyses. This simple test provides a significant prognostic value for patients with MM and could eventually lead clinicians to consider more aggressive strategies for patients with peripheral HFLC above 0.1%. Trial Registration: The authors have confirmed clinical trial registration is not needed for this submission
Robert D Brook, Pia-Allison Roa, Phillip D Levy
et al.
Background: Fine particulate matter <2.5 µm (PM2.5) air pollution increases cardio-pulmonary morbidity and mortality. To inform the public regarding local air quality and reduce short-term exposures to PM2.5, daily Air Quality Indices (AQIs) with activity guidance are issued across the United States (US). We aimed to estimate the cardio-pulmonary health benefits of the current AQI strategy. Methods: We estimated the number needed to treat (NNT) per county per day to avoid 1 atherosclerotic cardiovascular disease (ASCVD) or pulmonary event during each PM2.5 AQI action day (defined as a day with an AQI >100 for PM2.5) for 3 groups (healthy adults, ASCVD and pulmonary disease patients) in the US during 2022. The theoretical number of adverse events potentially prevented per county was estimated each day as the relevant county population divided by the corresponding NNT. These results were summed for all AQI action days among US counties to equal the potential total number of ASCVD and pulmonary events avoided nationwide. Results: There were 1081 AQI action days for PM2.5 (58.0 ± 32.4 µg/m3) across 236 counties in 2022, accounting for an estimated excess of 164 ASCVD and 3963 pulmonary events. Approximately 31 ASCVD and 796 pulmonary events could be theoretically avoided if all populations followed activity guidelines during every AQI action day. Conclusions: Daily AQI activity guidelines, as currently structured and implemented, likely prevent few cardio-pulmonary events per year in the US. Additional research is warranted to determine how to improve the overall approach to more effectively protect public health.
Diseases of the circulatory (Cardiovascular) system, Public aspects of medicine
BackgroundTo date, only limited data are available on right atrium (RA) morphofunctional remodeling in Fabry disease (FD).PurposeWe aimed to investigate RA structural and functional remodeling in patients with FD vs. healthy controls using 2D speckle tracking echocardiography (STE) and to explore whether any differences exist in FD patients with and without left ventricular hypertrophy (LVH).MethodsWe prospectively enrolled patients with FD and controls matched for age, sex, and cardiovascular risk factors. Patients with FD were divided in two groups according to the presence/absence of LVH (LVH+: left ventricular wall thickness >12 mm). All patients underwent standard echocardiography and STE analysis investigating the mechanics of all cardiac chambers, including RA reservoir, contractile and conduit strain.ResultsA total of 64 patients with FD (50% males; mean age 50 ± 17 years; 51.5% LVH+) and 64 control patients were included in the study. Focusing on right chambers, RA and right ventricular (RV) dimensions were similar between FD and controls. No differences were found for tricuspid annular plane systolic excursion (p = 0.073) and RV fractional area change (p = 0.461), while RV systolic Tissue Doppler velocity was reduced in patients with FD (p = 0.041). STE analysis revealed impaired strain values for all cardiac chambers in FD vs controls, specifically: left ventricular global longitudinal strain (LV-GLS, p < 0.001), left atrial (LA) reservoir strain (p = 0.001), conduit strain (p = 0.012), and contractile strain (p < 0.001), RV-GLS and RV free wall strain (p < 0.001). Similarly, all RA strain phases were significantly reduced in patients with FD compared with control patients (RA reservoir 27.4 ± 11.1 vs. 41.9 ± 8.3%, p < 0.001; RA contractile 9.9 ± 5.1 vs. 18.0 ± 4.9%, p < 0.001; RA conduit 19.1 ± 8.1 vs. 24.1 ± 8.1%, p = 0.001). When comparing FD patients without LVH to controls, it was found that RA reservoir and contractile strains were significantly reduced in the former (p < 0.001). In multivariable linear regression analyses, LA reservoir strain (p = 0.010) and LV-GLS (p = 0.044) emerged as independent correlates of RA mechanics after adjustments were made for RA dimensions, RV systolic function parameters and hypertrophy, and LV maximal wall thickness.ConclusionsIn FD impaired RA strain is a common finding. RA reservoir and contractile strains are reduced in FD patients even before LVH ensues, as compared to controls. LA reservoir strain and LV-GLS show an independent correlation with RA reservoir strain.
Diseases of the circulatory (Cardiovascular) system
M. Camila Melo-Narvaez, Fee Gölitz, Eshita Jain
et al.
Abstract Background Human precision-cut lung slices (hPCLS) are a unique platform for functional, mechanistic, and drug discovery studies in the field of respiratory research. However, tissue availability, generation, and cultivation time represent important challenges for their usage. Therefore, the present study evaluated the efficacy of a specifically designed tissue preservation solution, TiProtec, complete or in absence (-) of iron chelators, for long-term cold storage of hPCLS. Methods hPCLS were generated from peritumor control tissues and stored in DMEM/F-12, TiProtec, or TiProtec (-) for up to 28 days. Viability, metabolic activity, and tissue structure were determined. Moreover, bulk-RNA sequencing was used to study transcriptional changes, regulated signaling pathways, and cellular composition after cold storage. Induction of cold storage-associated senescence was determined by transcriptomics and immunofluorescence (IF). Finally, cold-stored hPCLS were exposed to a fibrotic cocktail and early fibrotic changes were assessed by RT-qPCR and IF. Results Here, we found that TiProtec preserves the viability, metabolic activity, transcriptional profile, as well as cellular composition of hPCLS for up to 14 days. Cold storage did not significantly induce cellular senescence in hPCLS. Moreover, TiProtec downregulated pathways associated with cell death, inflammation, and hypoxia while activating pathways protective against oxidative stress. Cold-stored hPCLS remained responsive to fibrotic stimuli and upregulated extracellular matrix-related genes such as fibronectin and collagen 1 as well as alpha-smooth muscle actin, a marker for myofibroblasts. Conclusions Optimized long-term cold storage of hPCLS preserves their viability, metabolic activity, transcriptional profile, and cellular composition for up to 14 days, specifically in TiProtec. Finally, our study demonstrated that cold-stored hPCLS can be used for on-demand mechanistic studies relevant for respiratory research. Graphical Abstract
Silvia Palombella, Silvia Lopa, Camilla Recordati
et al.
Abstract Background Osteoarthritis is a common degenerative joint disease marked by cartilage degeneration and inflammation. This study investigates the therapeutic potential of adipose-derived stromal cells (ASCs) and their secretome in a rat model of osteoarthritis. Methods ASCs were extracted from human adipose tissue, cultured, and primed with human platelet lysate. The secretome was collected after 48 h of serum-free culture. Osteoarthritis was induced in rats using monosodium iodoacetate, and after 14 days, they were treated with saline solution, ASCs, or secretome. Over five weeks, body weight and histopathological changes were monitored. Results No clinical complications arose post-treatment, and all rats gained weight similarly. ASC treatment increased histopathological changes associated with osteoarthritis, including severe cartilage necrosis and bone remodeling. Conversely, the secretome treatment resulted in mild to moderate cartilage degeneration, similar to that observed in the control group. These findings suggest that ASCs may contribute to disease progression in this model, while the secretome did not show significant effects on cartilage histology compared to the control group. Further studies are needed to determine whether optimizing the secretome composition or dosing could enhance its therapeutic potential. Conclusions This study highlights the complexity of ASC interactions with the immune system, while secretome may be a well-tolerated treatment, further studies are needed to determine its potential therapeutic benefits.
Extracting medication names from handwritten doctor prescriptions is challenging due to the wide variability in handwriting styles and prescription formats. This paper presents a robust method for extracting medicine names using a combination of Mask R-CNN and Transformer-based Optical Character Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A novel dataset, featuring diverse handwritten prescriptions from various regions of Pakistan, was utilized to fine-tune the model on different handwriting styles. The Mask R-CNN model segments the prescription images to focus on the medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and Positional Embeddings, transcribes the isolated text. The transcribed text is then matched against a pre-existing database for accurate identification. The proposed approach achieved a character error rate (CER) of 1.4% on standard benchmarks, highlighting its potential as a reliable and efficient tool for automating medicine name extraction.
Introduction: This article is part of the Focus Theme of METHODS of Information in Medicine on "Managing Interoperability and Complexity in Health Systems". Background: The proliferation of archetypes as a means to represent information of Electronic Health Records has raised the need of binding terminological codes - such as SNOMED CT codes - to their elements, in order to identify them univocally. However, the large size of the terminologies makes it difficult to perform this task manually. Objectives: To establish a baseline of results for the aforementioned problem by using off-the-shelf string comparison-based techniques against which results from more complex techniques could be evaluated. Methods: Nine Typed Comparison METHODS were evaluated for binding using a set of 487 archetype elements. Their recall was calculated and Friedman and Nemenyi tests were applied in order to assess whether any of the methods outperformed the others. Results: Using the qGrams method along with the 'Text' information piece of archetype elements outperforms the other methods if a level of confidence of 90% is considered. A recall of 25.26% is obtained if just one SNOMED CT term is retrieved for each archetype element. This recall rises to 50.51% and 75.56% if 10 and 100 elements are retrieved respectively, that being a reduction of more than 99.99% on the SNOMED CT code set. Conclusions: The baseline has been established following the above-mentioned results. Moreover, it has been observed that although string comparison-based methods do not outperform more sophisticated techniques, they still can be an alternative for providing a reduced set of candidate terms for each archetype element from which the ultimate term can be chosen later in the more-than-likely manual supervision task.
Thomas Savage, Stephen Ma, Abdessalem Boukil
et al.
Large Language Model (LLM) fine tuning is underutilized in the field of medicine. Two of the most common methods of fine tuning are Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO), but there is little guidance informing users when to use either technique. In this investigation, we compare the performance of SFT and DPO for five common natural language tasks in medicine: Classification with text data, Classification with numeric data, Clinical Reasoning, Summarization, and Clinical Triage. We find that SFT alone is sufficient for Classification with text data, whereas DPO improves performance for the more complex tasks of Clinical Reasoning, Summarization and Clinical Triage. Our results establish the role and importance of DPO fine tuning within medicine, and consequently call attention to current software gaps that prevent widespread deployment of this technique.
Generative artificial intelligence (AI) has brought revolutionary innovations in various fields, including medicine. However, it also exhibits limitations. In response, retrieval-augmented generation (RAG) provides a potential solution, enabling models to generate more accurate contents by leveraging the retrieval of external knowledge. With the rapid advancement of generative AI, RAG can pave the way for connecting this transformative technology with medical applications and is expected to bring innovations in equity, reliability, and personalization to health care.
Daniel Schwabe, Katinka Becker, Martin Seyferth
et al.
The adoption of machine learning (ML) and, more specifically, deep learning (DL) applications into all major areas of our lives is underway. The development of trustworthy AI is especially important in medicine due to the large implications for patients' lives. While trustworthiness concerns various aspects including ethical, technical and privacy requirements, we focus on the importance of data quality (training/test) in DL. Since data quality dictates the behaviour of ML products, evaluating data quality will play a key part in the regulatory approval of medical AI products. We perform a systematic review following PRISMA guidelines using the databases PubMed and ACM Digital Library. We identify 2362 studies, out of which 62 records fulfil our eligibility criteria. From this literature, we synthesise the existing knowledge on data quality frameworks and combine it with the perspective of ML applications in medicine. As a result, we propose the METRIC-framework, a specialised data quality framework for medical training data comprising 15 awareness dimensions, along which developers of medical ML applications should investigate a dataset. This knowledge helps to reduce biases as a major source of unfairness, increase robustness, facilitate interpretability and thus lays the foundation for trustworthy AI in medicine. Incorporating such systematic assessment of medical datasets into regulatory approval processes has the potential to accelerate the approval of ML products and builds the basis for new standards.
Abstract Background Ankylosing spondylitis (AS) has been known to have auto-inflammatory nature; hence, the efficacy of autoantibodies is low. However, studies on autoantibodies are ongoing, with some studies showing associations. Previous studies showed that anti-protein phosphatase magnesium-dependent 1A (PPM1A) IgG was increased in patients with AS and associated with radiographic progression. However, the diagnostic usefulness was limited due to relatively low sensitivity and specificity. This pilot study evaluated the diagnostic utility of anti-PPM1A-IgM and anti-PPM1A-IgG in patients with active AS. Methods Serum samples were obtained from the registry cohort of a single tertiary center in Korea. Serum levels of anti-PPM1A-IgG/IgM were measured by direct ELISA. Receiver operating characteristic (ROC) analysis was used to predict the diagnostic sensitivity and specificity of serum anti-PPM1A-IgG/IgM. Results Samples were collected from 28 patients with active AS, 16 healthy controls (HCs), and 28 patients with rheumatoid arthritis (RA). Although total serum IgM was lower in the RA and AS groups than in the HC group, anti-PPM1A-IgM was significantly lower in the AS group than in the other groups. In evaluating the diagnostic utility of anti-PPM1A-IgG/IgM for AS patients compared with HCs, the area under the curve (AUC) of anti-PPM1A-IgM was 0.998 (sensitivity 96.4%, specificity 100.0%). When ROC analysis of anti-PPM1A-IgM for AS patients compared with RA patients was conducted, sensitivity was 78.6% and specificity was 71.4%, with an AUC of 0.839. Conclusion Decreased anti-PPM1A-IgM levels in AS patients suggests a potential role for anti-PPM1A-IgM in the diagnosis of active AS.
Diseases of the musculoskeletal system, Immunologic diseases. Allergy
Abstract Background There is increasing evidence that myosteatosis, which is currently not assessed in clinical routine, plays an important role in risk estimation in individuals with impaired glucose metabolism, as it is associated with the progression of insulin resistance. With advances in artificial intelligence, automated and accurate algorithms have become feasible to fill this gap. Methods In this retrospective study, we developed and tested a fully automated deep learning model using data from two prospective cohort studies (German National Cohort [NAKO] and Cooperative Health Research in the Region of Augsburg [KORA]) to quantify myosteatosis on whole‐body T1‐weighted Dixon magnetic resonance imaging as (1) intramuscular adipose tissue (IMAT; the current standard) and (2) quantitative skeletal muscle (SM) fat fraction (SMFF). Subsequently, we investigated the two measures for their discrimination of and association with impaired glucose metabolism beyond baseline demographics (age, sex and body mass index [BMI]) and cardiometabolic risk factors (lipid panel, systolic blood pressure, smoking status and alcohol consumption) in asymptomatic individuals from the KORA study. Impaired glucose metabolism was defined as impaired fasting glucose or impaired glucose tolerance (140–200 mg/dL) or prevalent diabetes mellitus. Results Model performance was high, with Dice coefficients of ≥0.81 for IMAT and ≥0.91 for SM in the internal (NAKO) and external (KORA) testing sets. In the target population (380 KORA participants: mean age of 53.6 ± 9.2 years, BMI of 28.2 ± 4.9 kg/m2, 57.4% male), individuals with impaired glucose metabolism (n = 146; 38.4%) were older and more likely men and showed a higher cardiometabolic risk profile, higher IMAT (4.5 ± 2.2% vs. 3.9 ± 1.7%) and higher SMFF (22.0 ± 4.7% vs. 18.9 ± 3.9%) compared to normoglycaemic controls (all P ≤ 0.005). SMFF showed better discrimination for impaired glucose metabolism than IMAT (area under the receiver operating characteristic curve [AUC] 0.693 vs. 0.582, 95% confidence interval [CI] [0.06–0.16]; P < 0.001) but was not significantly different from BMI (AUC 0.733 vs. 0.693, 95% CI [−0.09 to 0.01]; P = 0.15). In univariable logistic regression, IMAT (odds ratio [OR] = 1.18, 95% CI [1.06–1.32]; P = 0.004) and SMFF (OR = 1.19, 95% CI [1.13–1.26]; P < 0.001) were associated with a higher risk of impaired glucose metabolism. This signal remained robust after multivariable adjustment for baseline demographics and cardiometabolic risk factors for SMFF (OR = 1.10, 95% CI [1.01–1.19]; P = 0.028) but not for IMAT (OR = 1.14, 95% CI [0.97–1.33]; P = 0.11). Conclusions Quantitative SMFF, but not IMAT, is an independent predictor of impaired glucose metabolism, and discrimination is not significantly different from BMI, making it a promising alternative for the currently established approach. Automated methods such as the proposed model may provide a feasible option for opportunistic screening of myosteatosis and, thus, a low‐cost personalized risk assessment solution.
Diseases of the musculoskeletal system, Human anatomy
Krystian Strzałka, Szymon Mazurek, Maciej Wielgosz
et al.
This paper explores the innovative use of simulation environments to enhance data acquisition and diagnostics in veterinary medicine, focusing specifically on gait analysis in dogs. The study harnesses the power of Blender and the Blenderproc library to generate synthetic datasets that reflect diverse anatomical, environmental, and behavioral conditions. The generated data, represented in graph form and standardized for optimal analysis, is utilized to train machine learning algorithms for identifying normal and abnormal gaits. Two distinct datasets with varying degrees of camera angle granularity are created to further investigate the influence of camera perspective on model accuracy. Preliminary results suggest that this simulation-based approach holds promise for advancing veterinary diagnostics by enabling more precise data acquisition and more effective machine learning models. By integrating synthetic and real-world patient data, the study lays a robust foundation for improving overall effectiveness and efficiency in veterinary medicine.
One of the major barriers to using large language models (LLMs) in medicine is the perception they use uninterpretable methods to make clinical decisions that are inherently different from the cognitive processes of clinicians. In this manuscript we develop novel diagnostic reasoning prompts to study whether LLMs can perform clinical reasoning to accurately form a diagnosis. We find that GPT4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy. This is significant because an LLM that can use clinical reasoning to provide an interpretable rationale offers physicians a means to evaluate whether LLMs can be trusted for patient care. Novel prompting methods have the potential to expose the black box of LLMs, bringing them one step closer to safe and effective use in medicine.