The Impact of Racism on Child and Adolescent Health
M. Trent, D. Dooley, J. Dougé
The American Academy of Pediatrics is committed to addressing the factors that affect child and adolescent health with a focus on issues that may leave some children more vulnerable than others. Racism is a social determinant of health that has a profound impact on the health status of children, adolescents, emerging adults, and their families. Although progress has been made toward racial equality and equity, the evidence to support the continued negative impact of racism on health and well-being through implicit and explicit biases, institutional structures, and interpersonal relationships is clear. The objective of this policy statement is to provide an evidence-based document focused on the role of racism in child and adolescent development and health outcomes. By acknowledging the role of racism in child and adolescent health, pediatricians and other pediatric health professionals will be able to proactively engage in strategies to optimize clinical care, workforce development, professional education, systems engagement, and research in a manner designed to reduce the health effects of structural, personally mediated, and internalized racism and improve the health and well-being of all children, adolescents, emerging adults, and their families.
The Lifelong Effects of Early Childhood Adversity and Toxic Stress
J. Shonkoff, A. Garner, B. Siegel
et al.
Advances in fields of inquiry as diverse as neuroscience, molecular biology, genomics, developmental psychology, epidemiology, sociology, and economics are catalyzing an important paradigm shift in our understanding of health and disease across the lifespan. This converging, multidisciplinary science of human development has profound implications for our ability to enhance the life prospects of children and to strengthen the social and economic fabric of society. Drawing on these multiple streams of investigation, this report presents an ecobiodevelopmental framework that illustrates how early experiences and environmental influences can leave a lasting signature on the genetic predispositions that affect emerging brain architecture and long-term health. The report also examines extensive evidence of the disruptive impacts of toxic stress, offering intriguing insights into causal mechanisms that link early adversity to later impairments in learning, behavior, and both physical and mental well-being. The implications of this framework for the practice of medicine, in general, and pediatrics, specifically, are potentially transformational. They suggest that many adult diseases should be viewed as developmental disorders that begin early in life and that persistent health disparities associated with poverty, discrimination, or maltreatment could be reduced by the alleviation of toxic stress in childhood. An ecobiodevelopmental framework also underscores the need for new thinking about the focus and boundaries of pediatric practice. It calls for pediatricians to serve as both front-line guardians of healthy child development and strategically positioned, community leaders to inform new science-based strategies that build strong foundations for educational achievement, economic productivity, responsible citizenship, and lifelong health.
ADHD: Clinical Practice Guideline for the Diagnosis, Evaluation, and Treatment of Attention-Deficit/Hyperactivity Disorder in Children and Adolescents
M. Wolraich, J. Hagan, Carla C. Allan
et al.
Attention-deficit/hyperactivity disorder (ADHD) is the most common neurobehavioral disorder of childhood and can profoundly affect the academic achievement, well-being, and social interactions of children; the American Academy of Pediatrics first published clinical recommendations for the diagnosis and evaluation of ADHD in children in 2000; recommendations for treatment followed in 2001.
Year 2000 Position Statement: Principles and Guidelines for Early Hearing Detection and Intervention Programs
Welfare Agencies
Breastfeeding and the Use of Human Milk.
Clinical-Injection Transformer with Domain-Adapted MAE for Lupus Nephritis Prognosis Prediction
Yuewen Huang, Zhitao Ye, Guangnan Feng
et al.
Lupus nephritis (LN) is a severe complication of systemic lupus erythematosus that affects pediatric patients with significantly greater severity and worse renal outcomes compared to adults. Despite the urgent clinical need, predicting pediatric LN prognosis remains unexplored in computational pathology. Furthermore, the only existing histopathology-based approach for LN relies on multiple costly staining protocols and fails to integrate complementary clinical data. To address these gaps, we propose the first multimodal computational pathology framework for three-class treatment response prediction (complete remission, partial response, and no response) in pediatric LN, utilizing only routine PAS-stained biopsies and structured clinical data. Our framework introduces two key methodological innovations. First, a Clinical-Injection Transformer (CIT) embeds clinical features as condition tokens into patch-level self-attention, facilitating implicit and bidirectional cross-modal interactions within a unified attention space. Second, we design a decoupled representation-knowledge adaptation strategy using a domain-adapted Masked Autoencoder (MAE). This strategy explicitly separates self-supervised morphological feature learning from pathological knowledge extraction. Additionally, we introduce a multi-granularity morphological type injection mechanism to bridge distilled classification knowledge with downstream prognostic predictions at both the instance and patient levels. Evaluated on a cohort of 71 pediatric LN patients with KDIGO-standardized labels, our method achieves a three-class accuracy of 90.1% and an AUC of 89.4%, demonstrating its potential as a highly accurate and cost-effective prognostic tool.
Personalized White Matter Bundle Segmentation for Early Childhood
Elyssa M. McMaster, Michael E. Kim, Nancy R. Newlin
et al.
White matter segmentation methods from diffusion magnetic resonance imaging range from streamline clustering-based approaches to bundle mask delineation, but none have proposed a pediatric-specific approach. We hypothesize that a deep learning model with a similar approach to TractSeg will improve similarity between an algorithm-generated mask and an expert-labeled ground truth. Given a cohort of 56 manually labelled white matter bundles, we take inspiration from TractSeg's 2D UNet architecture, and we modify inputs to match bundle definitions as determined by pediatric experts, evaluation to use k fold cross validation, the loss function to masked Dice loss. We evaluate Dice score, volume overlap, and volume overreach of 16 major regions of interest compared to the expert labeled dataset. To test whether our approach offers statistically significant improvements over TractSeg, we compare Dice voxels, volume overlap, and adjacency voxels with a Wilcoxon signed rank test followed by false discovery rate correction. We find statistical significance across all bundles for all metrics with one exception in volume overlap. After we run TractSeg and our model, we combine their output masks into a 60 label atlas to evaluate if TractSeg and our model combined can generate a robust, individualized atlas, and observe smoothed, continuous masks in cases that TractSeg did not produce an anatomically plausible output. With the improvement of white matter pathway segmentation masks, we can further understand neurodevelopment on a population level scale, and we can produce reliable estimates of individualized anatomy in pediatric white matter diseases and disorders.
Evaluating the readability, understandability, quality, and popularity of online materials about epilepsy in children
Fatma Sargin, Mehmet Alçı, Büşra Kaygusuz Aydemir
et al.
Abstract Background Childhood epilepsy is one of the most common neurological disorders worldwide, significantly affecting cognitive, emotional, and social development. As caregivers often seek medical guidance online, the readability, understandability, and quality of internet-based patient education materials (IPEMs) are crucial for health literacy and decision-making. This study evaluates the readability, understandability, quality, and popularity of online pediatric epilepsy materials in relation to established health communication standards. Methods A Google search using the term ‘epilepsy in children’ (July 20, 2025) identified 84 eligible English-language websites. These were classified as (I) academic departments/societies, (II) clinics/hospitals, and (III) miscellaneous healthcare platforms. Readability was measured by seven validated indices and summarized as the Average Reading Level Consensus (ARLC), understandability by Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), quality by Journal of the American Medical Association (JAMA) benchmarks, and popularity via Similarweb. Results The mean ARLC was 11.13 ± 2.10, exceeding the recommended sixth-grade level, with no readability differences across website types (p = 0.167). PEMAT scores were high (median 82.3%) and varied by source (p = 0.022), favoring academic sites. Only 34.52% met high-quality standards (JAMA ≥ 3), with Group I superior (p = 0.002). Readability showed no significant correlation with understandability (r = − 0.151, p = 0.170) or quality (r = − 0.154, p = 0.161). Most websites had moderate popularity. Conclusions Although many pediatric epilepsy websites are understandable, most surpass recommended readability levels and lack key quality indicators. Healthcare professionals should guide families to reliable, accessible resources and promote user-centered digital health communication.
MRI-to-CT Synthesis With Cranial Suture Segmentations Using A Variational Autoencoder Framework
Krithika Iyer, Austin Tapp, Athelia Paulli
et al.
Quantifying normative pediatric cranial development and suture ossification is crucial for diagnosing and treating growth-related cephalic disorders. Computed tomography (CT) is widely used to evaluate cranial and sutural deformities; however, its ionizing radiation is contraindicated in children without significant abnormalities. Magnetic resonance imaging (MRI) offers radiation free scans with superior soft tissue contrast, but unlike CT, MRI cannot elucidate cranial sutures, estimate skull bone density, or assess cranial vault growth. This study proposes a deep learning driven pipeline for transforming T1 weighted MRIs of children aged 0.2 to 2 years into synthetic CTs (sCTs), predicting detailed cranial bone segmentation, generating suture probability heatmaps, and deriving direct suture segmentation from the heatmaps. With our in-house pediatric data, sCTs achieved 99% structural similarity and a Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained an average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Equivalence of skull and suture segmentation between sCTs and real CTs was confirmed using two one sided tests (TOST p < 0.05). To our knowledge, this is the first pediatric cranial CT synthesis framework to enable suture segmentation on sCTs derived from MRI, despite MRI's limited depiction of bone and sutures. By combining robust, domain specific variational autoencoders, our method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs, bridging critical gaps in non invasive cranial evaluation.
MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models
Kangkun Mao, Jinru Ding, Jiayuan Chen
et al.
As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at https://github.com/maokangkun/MedCalc-Eval.
Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs
Junjie Luo, Rui Han, Arshana Welivita
et al.
Understanding how patients perceive their physicians is essential to improving trust, communication, and satisfaction. We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient-oriented subjective judgments. The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million. We validate the method through multi-model comparison and human expert benchmarking, achieving strong agreement between human and LLM assessments (correlation coefficients 0.72-0.89) and external validity through correlations with patient satisfaction (r = 0.41-0.81, all p<0.001). National-scale analysis reveals systematic patterns: male physicians receive higher ratings across all traits, with largest disparities in clinical competence perceptions; empathy-related traits predominate in pediatrics and psychiatry; and all traits positively predict overall satisfaction. Cluster analysis identifies four distinct physician archetypes, from "Well-Rounded Excellent" (33.8%, uniformly high traits) to "Underperforming" (22.6%, consistently low). These findings demonstrate that automated trait extraction from patient narratives can provide interpretable, validated metrics for understanding physician-patient relationships at scale, with implications for quality measurement, bias detection, and workforce development in healthcare.
Rapid Bone Scintigraphy Enhancement via Semantic Prior Distillation from Segment Anything Model
Pengchen Liang, Leijun Shi, Huiping Yao
et al.
Rapid bone scintigraphy is crucial for diagnosing skeletal disorders and detecting tumor metastases in children, as it shortens scan duration and reduces discomfort. However, accelerated acquisition often degrades image quality, impairing the visibility of fine anatomical details and potentially compromising diagnosis. To overcome this limitation, we introduce the first application of SAM-based semantic priors for medical image restoration, utilizing the Segment Anything Model (SAM) to enhance pediatric rapid bone scintigraphy. Our approach employs two cascaded networks, $f^{IR1}$ and $f^{IR2}$, supported by three specialized modules: a Semantic Prior Integration (SPI) module, a Semantic Knowledge Distillation (SKD) module, and a Semantic Consistency Module (SCM). The SPI and SKD modules inject domain-specific semantic cues from a fine-tuned SAM, while the SCM preserves coherent semantic feature representations across both cascaded stages. Moreover, we present RBS, a novel Rapid Bone Scintigraphy dataset comprising paired standard (20 cm/min) and rapid (40 cm/min) scans from 137 pediatric patients aged 0.5 - 16 years, making it the first dataset tailored for pediatric rapid bone scintigraphy restoration. Extensive experiments on both a public endoscopic dataset and our RBS dataset demonstrate that our method consistently surpasses existing techniques in PSNR, SSIM, FID, and LPIPS metrics.
Large language models provide unsafe answers to patient-posed medical questions
Rachel L. Draelos, Samina Afreen, Barbara Blasko
et al.
Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots--Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta--on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women's health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.
Borrowing From the Future: Enhancing Early Risk Assessment through Contrastive Learning
Minghui Sun, Matthew M. Engelhard, Benjamin A. Goldstein
Risk assessments for a pediatric population are often conducted across multiple stages. For example, clinicians may evaluate risks prenatally, at birth, and during Well-Child visits. Although predictions made at later stages typically achieve higher precision, it is clinically desirable to make reliable risk assessments as early as possible. Therefore, this study focuses on improving prediction performance in early-stage risk assessments. Our solution, \textbf{Borrowing From the Future (BFF)}, is a contrastive multi-modal framework that treats each time window as a distinct modality. In BFF, a model is trained on all available data throughout the time while performing a risk assessment using up-to-date information. This contrastive framework allows the model to ``borrow'' informative signals from later stages (e.g., Well-Child visits) to implicitly supervise the learning at earlier stages (e.g., prenatal/birth stages). We validate BFF on two real-world pediatric outcome prediction tasks, demonstrating consistent improvements in early risk assessments. The code is available at https://github.com/scotsun/bff.
Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe
Erin Palm, Astrit Manikantan, Mark E. Pepin
et al.
In medical practices across the United States, physicians have begun implementing generative artificial intelligence (AI) tools to perform the function of scribes in order to reduce the burden of documenting clinical encounters. Despite their widespread use, no established methods exist to gauge the quality of AI scribes. To address this gap, we developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters. Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality, which we adapted to assess relative performance of AI generated notes. Clinical experts spanning 5 medical specialties used the PDQI9 tool to evaluate specialist-drafted Gold notes and LLM authored Ambient notes. Two evaluators from each specialty scored notes drafted from a total of 97 patient visits. We found uniformly high inter rater agreement (RWG greater than 0.7) between evaluators in general medicine, orthopedics, and obstetrics and gynecology, and moderate (RWG 0.5 to 0.7) to high inter rater agreement in pediatrics and cardiology. We found a modest yet significant difference in the overall note quality, wherein Gold notes achieved a score of 4.25 out of 5 and Ambient notes scored 4.20 out of 5 (p = 0.04). Our findings support the use of the PDQI9 instrument as a practical method to gauge the quality of LLM authored notes, as compared to human-authored notes.
CLINICAL-IMMUNOLOGICAL RATIONALE FOR TREATMENT OF ACUTE ADHESIVE SMALL BOWEL OBSTRUCTION: PROGNOSTIC SCALE AND EFFICACY OF IMMUNE CORRECTION
Б. Хамдамов, А. Ешчанов, С. Давлатов
et al.
Acute adhesive small bowel obstruction (AASBO) ranks among the most common indications for emergency abdominal surgery. However, conventional approaches to severity assessment and treatment selection do not incorporate the significant immunological disturbances characteristic of this condition.
Objective. To enhance the diagnosis and management of AASBO through the implementation of a clinical-immunological risk stratification scale and to evaluate the impact of immunotherapy on clinical and immune outcomes.
Materials and Methods. A total of 115 patients with AASBO were enrolled and allocated to a control group (n=56; standard care) or a study group (n=59; standard care plus immunotherapy). Comprehensive clinical, laboratory, radiological, and immunological parameters were analyzed. A prognostic scale incorporating 25 variables was developed. Immune response dynamics and complication rates were rigorously assessed. All procedures were conducted in accordance with the World Medical Association's Declaration of Helsinki (2000 amendments). Statistical analyses were performed using SPSS 22.0 and MedCalc software. Normality was assessed with Shapiro-Wilk test. Between-group comparisons utilized Mann-Whitney U test, Student's t-test, Pearson's chi-square test, and Spearman's correlation analysis. Predictive performance was evaluated through ROC analysis calculating the area under the curve (AUC). Statistical significance was defined as p < 0.05.
Funding: This work was carried out within the framework of the research plan of the Bukhara State Medical Institute (05.2022 DSc.135) entitled “Development of new approaches to early diagnosis, treatment, and prevention of pathological conditions affecting the health of the population of the Bukhara region after COVID-19 (2022–2026)”.
Results. Severe disease progression was significantly associated with CD4⁺ counts < 600 cells/µL, HLA-DR⁺ expression < 30%, IL-6 levels > 30 pg/mL, and TNF-α levels > 25 pg/mL. The study group demonstrated a 3.3-fold reduction in mortality, a 28.1% shorter hospital stay, and a 2.4-fold decrease in Clavien-Dindo grade III-V complications. The integrated prognostic scale exhibited high predictive accuracy (AUC = 0.917).
Conclusion. The incorporation of clinical-immunological stratification and immunotherapy into the management algorithm for AASBO significantly improves treatment outcomes, reduces complication rates, and decreases mortality.
Pediatrics, Gynecology and obstetrics
Comparison of Different Neonatal Disease Severity Scoring Systems for Predicting Mortality Risk in Neonatal Intensive Care Unit: A Cross-sectional Study
P Selvaraj, KT Muhammed Basheer
Introduction: To predict the risk of mortality among neonates, birth weight and gestational age have previously been used. However, a single parameter was inadequate to predict the severity of illness and outcomes for neonates. Therefore, a combination of parameters has been employed to create disease severity scoring systems aimed at predicting mortality. Consequently, various scoring systems have been developed in recent years. There is a need to assess the severity of illness in newborns, provide prognostic information to parents and formulate a new disease severity scoring system for the Neonatal Intensive Care Unit (NICU) unit.
Aim: To evaluate and compare the predictive accuracy of neonatal disease severity scoring systems {Score for Neonatal Acute Physiology-Perinatal Extension II (SNAP-PE II), Transport Risk Index of Physiologic Stability (TRIPS), Mortality Index for Neonatal Transportation (MINT), Transport Related Mortality Score (TREMS) and Sick Neonate Score (SNS)} in assessing neonatal mortality risk upon admission to the NICU.
Materials and Methods: This analytical cross-sectional study was conducted between September 2023 and August 2024 at Level II and Level III NICU of Malabar Medical College Hospital and Research Centre, Ulliyeri, Kozhikode, Kerala, India. Data on neonatal characteristics at admission, perinatal characteristics, maternal characteristics and transport information for 400 newborns who met the inclusion criteria were collected. Each parameter from the five disease severity scoring systems was obtained and recorded. The scores for SNAP-PE II, TRIPS, MINT, TREMS and SNS for all cases were then calculated. At the end of the seventh day of admission, the outcomes were measured as survivors and non survivors.
Results: Out of 390 neonates studied, 330 (84.6%) were survivors and 60 (15.4%) were non survivors. The median and interquartile range of the SNAP-PE II, TRIPS, MINT, TREMS and SNS scoring systems were higher for non survivors than for survivors. Key predictors of mortality, including admission weight, birth weight, 1-minute and 5-minute Appearance, Pulse, Grimace, Activity and Respiration (APGAR) scores, gestational age and the need for resuscitation, were identified as strong indicators of mortality, regardless of age at admission.
Conclusion: Neonatal disease severity scoring systems provide prognostic information, which assists in counselling parents. They also facilitate the evaluation of transport systems.
Multimodal growth and development assessment model
Ying Li, Zichen Song, Zijie Gong
et al.
With the development of social economy and the improvement of people's attention to health, the growth and development of children and adolescents has become an important indicator to measure the level of national health. Therefore, accurate and timely assessment of children's growth and development has become increasingly important. At the same time, global health inequalities, especially child malnutrition and stunting in developing countries, urgently require effective assessment tools to monitor and intervene. In recent years, the rapid development of technologies such as big data, artificial intelligence, and cloud computing, and the cross-integration of multiple disciplines such as biomedicine, statistics, and computer science have promoted the rapid development of large-scale models for growth and development assessment. However, there are still problems such as too single evaluation factors, inaccurate diagnostic results, and inability to give accurate and reasonable recommendations. The multi-modal growth and development assessment model uses the public data set of RSNA ( North American College of Radiology ) as the training set, and the data set of the Department of Pediatrics of Huaibei People's Hospital as the open source test set. The embedded ICL module enables the model to quickly adapt and identify the tasks that need to be done to ensure that under the premise of considering multiple evaluation factors, accurate diagnosis results and reasonable medical recommendations are given, so as to provide solutions to the above problems and promote the development of the medical field.
A disruption-restoration-based MILP model for elective surgical scheduling in a children's hospital using scenarios
Martina Doneda, Gloria Pelizzo, Sara Costanzo
et al.
We consider the problem of scheduling elective surgeries in a Children's Hospital, where disruptions due to emergencies and no-shows may arise. We account for two features that occur in many pediatric settings: i) that it is not uncommon for pediatric patients to fall ill on the very day of their operation and, consequentially, to be unable to undergo surgery and ii) that operating rooms normally reserved for elective surgeries can be used to treat emergency cases. Elective surgeries are scheduled taking into account the time spent on the waiting list and the patient's priority, which considers the severity of their condition and their surgical deadline, generating a nominal schedule. This schedule is optimized in conjunction with a series of back-up schedules: in fact, back-up schedules shall be available in advance so as to guarantee that the operating rooms activity immediately recovers in case of a disruption. We propose an Integer Linear Programming-based approach for the problem. As there is no consolidated data on the features of both emergencies and no show, we enumerate a representative subset of the possible emergency and no-show scenarios and for each of them a back-up plan is designed. The approach reschedules patients in a way that minimizes disruption with respect to the nominal schedule and applies an as-soon-as-possible policy in case of emergencies to ensure that all patients receive timely care. The approach shows to be effective in managing disruptions, ensuring that the waiting list is managed properly, with a balanced mix of urgent and less urgent patients. Therefore, the approach provides an effective solution for scheduling patients in a pediatric hospital, taking into account the unique features of such facilities.
Evaluation of Bias Towards Medical Professionals in Large Language Models
Xi Chen, Yang Xu, MingKe You
et al.
This study evaluates whether large language models (LLMs) exhibit biases towards medical professionals. Fictitious candidate resumes were created to control for identity factors while maintaining consistent qualifications. Three LLMs (GPT-4, Claude-3-haiku, and Mistral-Large) were tested using a standardized prompt to evaluate resumes for specific residency programs. Explicit bias was tested by changing gender and race information, while implicit bias was tested by changing names while hiding race and gender. Physician data from the Association of American Medical Colleges was used to compare with real-world demographics. 900,000 resumes were evaluated. All LLMs exhibited significant gender and racial biases across medical specialties. Gender preferences varied, favoring male candidates in surgery and orthopedics, while preferring females in dermatology, family medicine, obstetrics and gynecology, pediatrics, and psychiatry. Claude-3 and Mistral-Large generally favored Asian candidates, while GPT-4 preferred Black and Hispanic candidates in several specialties. Tests revealed strong preferences towards Hispanic females and Asian males in various specialties. Compared to real-world data, LLMs consistently chose higher proportions of female and underrepresented racial candidates than their actual representation in the medical workforce. GPT-4, Claude-3, and Mistral-Large showed significant gender and racial biases when evaluating medical professionals for residency selection. These findings highlight the potential for LLMs to perpetuate biases and compromise healthcare workforce diversity if used without proper bias mitigation strategies.