Zhaoyang Jacopo Hu, Alex Ranne, Alaa Eldin Abdelaal
et al.
Automated feedback systems have the potential to provide objective skill assessment for training and evaluation in robot-assisted surgery. In this study, we examine methods to achieve real-time prediction of surgical skill level in real-time based on Objective Structured Assessment of Technical Skills (OSATS) scores. Using data acquired from the da Vinci Surgical System, we carry out three main analyses, focusing on model design, their real-time performance, and their skill-level-based cross-validation training. For the model design, we evaluate the effectiveness of multimodal deep learning models for predicting surgical skill levels using synchronized kinematic and vision data. Our models include separate unimodal baselines and fusion architectures that integrate features from both modalities and are evaluated using mean Spearman's correlation coefficients, demonstrating that the fusion model consistently outperforms unimodal models for real-time predictions. For the real-time performance, we observe the prediction's trend over time and highlight correlation with the surgeon's gestures. For the skill-level-based cross-validation, we separately trained models on surgeons with different skill levels, which showed that high-skill demonstrations allow for better performance than those trained on low-skilled ones and generalize well to similarly skilled participants. Our findings show that multimodal learning allows more stable fine-grained evaluation of surgical performance and highlights the value of expert-level training data for model generalization.
Sergio Micó-Rosa, Concepcion Garcia-Pardo, Matteo Frasson
et al.
The dielectric characterization of human tissues can play a crucial role in the development of new medical diagnostic tools. In particular, the characterization of healthy and pathological tissues can provide vital information for diagnosis. In this paper, preliminary results from a small-scale measurement campaign conducted in 0.5-26.5GHz during real surgeries on healthy and malignant human colon tissues are presented. Those measurements were carried out externally to the colon, without direct contact to the tumor growing inside the colon. Furthermore, different tumor stages are taken into account. Initial findings reveal that advanced tumor stages are related with increased higher values of dielectric properties in malignant tumor tissues compared to the healthy ones.
Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks-bimanual hand pose estimation and hand-instrument interaction reconstruction-and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand-two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23% in ADD-S metrics for hand and instrument reconstruction, respectively.
Scalable quantum computation requires not only quantum codes with low memory overhead but also encoded operations with low space-time overhead. High rate quantum low-density parity-check (qLDPC) codes address the former by achieving a high information-encoding rate, yet existing methods for implementing logical operations often suffer from a low information-processing rate, leading to substantial space-time costs. Here, we introduce high-rate surgery, a general scheme that can perform extensive, addressable logical Pauli-product measurements in parallel on arbitrary qLDPC codes using a shared ancilla system, attaining nearly constant space-time overhead. We develop both algebraic and randomized ancilla constructions and demonstrate, using the $[[144, 12, 12]]$ Gross code and new instances of qLDPC codes (e.g., $[[1125, 245, \leq 10]]$) with encoding rate up to $25\%$, that up to hundreds of randomly sampled logical measurements can be executed simultaneously with a total space-time overhead around a factor of two of that of memory experiments. Our results address a major bottleneck for performing complex, addressable logical operations on qLDPC codes in practice, advancing the prospect of scalable, constant-overhead fault-tolerant quantum computation.
Surgical phase recognition from video enables various downstream applications. Transformer-based sliding window approaches have set the state-of-the-art by capturing rich spatial-temporal features. However, while transformers can theoretically handle arbitrary-length sequences, in practice they are limited by memory and compute constraints, resulting in fixed context windows that struggle with maintaining temporal consistency across lengthy surgical procedures. This often leads to fragmented predictions and limited procedure-level understanding. To address these challenges, we propose Memory of Surgery (MoS), a framework that enriches temporal modeling by incorporating both semantic interpretable long-term surgical history and short-term impressions. MoSFormer, our enhanced transformer architecture, integrates MoS using a carefully designed encoding and fusion mechanism. We further introduce step filtering to refine history representation and develop a memory caching pipeline to improve training and inference stability, mitigating shortcut learning and overfitting. MoSFormer demonstrates state-of-the-art performance on multiple benchmarks. On the Challenging BernBypass70 benchmark, it attains 88.0 video-level accuracy and phase-level metrics of 70.7 precision, 68.7 recall, and 66.3 F1 score, outperforming its baseline with 2.1 video-level accuracy and phase-level metrics of 4.6 precision, 3.6 recall, and 3.8 F1 score. Further studies confirms the individual and combined benefits of long-term and short-term memory components through ablation and counterfactual inference. Qualitative results shows improved temporal consistency. The augmented temporal context enables procedure-level understanding, paving the way for more comprehensive surgical video analysis.
Pooja P Jain, Pietro Mascagni, Giuseppe Massimiani
et al.
Minimally invasive colorectal surgery is characterized by procedural variability, a difficult learning curve, and complications that impact quality and outcomes. Video-based assessment (VBA) offers an opportunity to generate data-driven insights to reduce variability, optimize training, and improve surgical performance. However, existing tools for workflow analysis remain difficult to standardize and implement. This study aims to develop and validate a VBA tool for workflow analysis across minimally invasive colorectal procedures. A Delphi process was conducted to achieve consensus on generalizable workflow descriptors. The resulting framework informed the development of a new VBA tool, ColoWorkflow. Independent raters then applied ColoWorkflow to a multicentre video dataset of laparoscopic and robotic colorectal surgery (CRS). Applicability and inter-rater reliability were evaluated. Consensus was achieved for 10 procedure-agnostic phases and 34 procedure-specific steps describing CRS workflows. ColoWorkflow was developed and applied to 54 colorectal operative videos (left and right hemicolectomies, sigmoid and rectosigmoid resections, and total proctocolectomies) from five centres. The tool demonstrated broad applicability, with all but one label utilized. Inter-rater reliability was moderate, with mean Cohen's K of 0.71 for phases and 0.66 for steps. Most discrepancies arose at phase transitions and step boundary definitions. ColoWorkflow is the first consensus-based, validated VBA tool for comprehensive workflow analysis in minimally invasive CRS. It establishes a reproducible framework for video-based performance assessment, enabling benchmarking across institutions and supporting the development of artificial intelligence-driven workflow recognition. Its adoption may standardize training, accelerate competency acquisition, and advance data-informed surgical quality improvement.
Existing real-time decoders for surface codes are limited to isolated logical qubits and do not support logical operations involving multiple logical qubits. We present DECONET, a first-of-its-kind decoding system that scales to thousands of logical qubits and supports logical operations implemented through lattice surgery. DECONET organizes compute resources in a network-integrated hybrid tree-grid structure, which results in minimal latency increase and no throughput degradation as the system grows. Specifically, DECONET can be scaled to any arbitrary number of $l$ logical qubits by increasing the compute resources by $O(l \times log(l))$, which provides the required $O(l)$ growth in I/O resources while incurring only an $O(log(l))$ increase in latency-a modest growth that is sufficient for thousands of logical qubits. Moreover, we analytically show that the scaling approach preserves throughput, keeping DECONET backlog-free for any number of logical qubits. We report an exploratory prototype of DECONET, called DECONET/HELIOS, built with five VMK-180 FPGAs, that successfully decodes 100 logical qubits of distance five. For 100 logical qubits, under a phenomenological noise rate of 0.1%, the DECONET/HELIOS has an average latency of 2.40 μs and an inverse throughput of 0.84 μs per measurement round.
Muhammad Hanif Lashari, Wafa Batayneh, Ashfaq Khokhar
et al.
Accurately estimating the position of a patient's side robotic arm in real time during remote surgery is a significant challenge, especially within Tactile Internet (TI) environments. This paper presents a new and efficient method for position estimation using a Kalman Filter (KF) combined with the Multivariable Output-Error State Space (MOESP) method for system identification. Unlike traditional approaches that require prior knowledge of the system's dynamics, this study uses the JIGSAW dataset, a comprehensive collection of robotic surgical data, along with input from the Master Tool Manipulator (MTM) to derive the state-space model directly. The MOESP method allows accurate modeling of the Patient Side Manipulator (PSM) dynamics without prior system models, improving the KF's performance under simulated network conditions, including delays, jitter, and packet loss. These conditions mimic real-world challenges in Tactile Internet applications. The findings demonstrate the KF's improved resilience and accuracy in state estimation, achieving over 95 percent accuracy despite network-induced uncertainties.
Research on autonomous surgery has largely focused on simple task automation in controlled environments. However, real-world surgical applications demand dexterous manipulation over extended durations and generalization to the inherent variability of human tissue. These challenges remain difficult to address using existing logic-based or conventional end-to-end learning approaches. To address this gap, we propose a hierarchical framework for performing dexterous, long-horizon surgical steps. Our approach utilizes a high-level policy for task planning and a low-level policy for generating robot trajectories. The high-level planner plans in language space, generating task-level or corrective instructions that guide the robot through the long-horizon steps and correct for the low-level policy's errors. We validate our framework through ex vivo experiments on cholecystectomy, a commonly-practiced minimally invasive procedure, and conduct ablation studies to evaluate key components of the system. Our method achieves a 100\% success rate across eight unseen ex vivo gallbladders, operating fully autonomously without human intervention. This work demonstrates step-level autonomy in a surgical procedure, marking a milestone toward clinical deployment of autonomous surgical systems.
Yarden Sharon, Alex Geftler, Hanna Kossowsky Lev
et al.
Objective: We aim to investigate long-term robotic surgical skill acquisition among surgical residents and the effects of training intervals and fatigue on performance. Methods: For six months, surgical residents participated in three training sessions once a month, surrounding a single 26-hour hospital shift. In each shift, they participated in training sessions scheduled before, during, and after the shift. In each training session, they performed three dry-lab training tasks: Ring Tower Transfer, Knot-Tying, and Suturing. We collected a comprehensive dataset, including videos synchronized with kinematic data, activity tracking, and scans of the suturing pads. Results: We collected a dataset of 972 trials performed by 18 residents of different surgical specializations. Participants demonstrated consistent performance improvement across all tasks. In addition, we found variations in between-shift learning and forgetting across metrics and tasks, and hints for possible effects of fatigue. Conclusion: The findings from our first analysis shed light on the long-term learning processes of robotic surgical skills with extended intervals and varying levels of fatigue. Significance: This study lays the groundwork for future research aimed at optimizing training protocols and enhancing AI applications in surgery, ultimately contributing to improved patient outcomes. The dataset will be made available upon acceptance of our journal submission.
In recent years, Visual Question Localized-Answering in robotic surgery (Surgical-VQLA) has gained significant attention for its potential to assist medical students and junior doctors in understanding surgical scenes. Recently, the rapid development of Large Language Models (LLMs) has provided more promising solutions for this task. However, current methods struggle to establish complex dependencies between text and visual details, and have difficulty perceiving the spatial information of surgical scenes. To address these challenges, we propose a novel method, Surgical-MambaLLM, which is the first to combine Mamba2 with LLM in the surgical domain, that leverages Mamba2's ability to effectively capture cross-modal dependencies and perceive spatial information in surgical scenes, thereby enhancing the LLMs' understanding of surgical images. Specifically, we propose the Cross-modal Bidirectional Mamba2 Integration (CBMI) module to leverage Mamba2 for effective multimodal fusion, with its cross-modal integration capabilities. Additionally, tailored to the geometric characteristics of surgical scenes, we design the Surgical Instrument Perception (SIP) scanning mode for Mamba2 to scan the surgical images, enhancing the model's spatial understanding of the surgical scene. Extensive experiments demonstrate that our Surgical-MambaLLM model outperforms the state-of-the-art methods on the EndoVis17-VQLA and EndoVis18-VQLA datasets, significantly improving the performance of the Surgical-VQLA task.
Ruibo Shang, Matthew Thompson, Matthew D. Carson
et al.
Near-infrared fluorescence (NIRF) can deliver high-contrast, video-rate, non-contact imaging of tumor-targeted contrast agents with the potential to guide surgeries excising solid tumors. However, it has been met with skepticism for wide-margin excision due to sensitivity and resolution limitations at depths larger than ~5 mm in tissue. To address this limitation, fast-sweep photoacoustic-ultrasound (PAUS) imaging is proposed to complement NIRF. In an exploratory in vitro feasibility study using dark-red bovine muscle tissue, we observed that PAUS scanning can identify tozuleristide, a clinical stage investigational imaging agent, at a concentration of 20 uM from the background at depths of up to ~34 mm, highly extending the capabilities of NIRF alone. The capability of spectroscopic PAUS imaging was tested by direct injection of 20 uM tozuleristide into bovine muscle tissue at a depth of ~ 8 mm. It is shown that laser-fluence compensation and strong clutter suppression enabled by the unique capabilities of the fast-sweep approach greatly improve spectroscopic accuracy and the PA detection limit, and strongly reduce image artifacts. Thus, the combined NIRF-PAUS approach can be promising for comprehensive pre- (with PA) and intra- (with NIRF) operative solid tumor detection and wide-margin excision in optically guided solid tumor surgery.
The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.
This paper tackles instrument tracking and 3D visualization challenges in minimally invasive surgery (MIS), crucial for computer-assisted interventions. Conventional and robot-assisted MIS encounter issues with limited 2D camera projections and minimal hardware integration. The objective is to track and visualize the entire surgical instrument, including shaft and metallic clasper, enabling safe navigation within the surgical environment. The proposed method involves 2D tracking based on segmentation maps, facilitating creation of labeled dataset without extensive ground-truth knowledge. Geometric changes in 2D intervals express motion, and kinematics based algorithms process results into 3D tracking information. Synthesized and experimental results in 2D and 3D motion estimates demonstrate negligible errors, validating the method for labeling and motion tracking of instruments in MIS videos. The conclusion underscores the proposed 2D segmentation technique's simplicity and computational efficiency, emphasizing its potential as direct plug-in for 3D visualization in instrument tracking and MIS practices.
Low back pain (LBP) and sciatica may require surgical therapy when they are symptomatic of severe pain. However, there is no effective measures to evaluate the surgical outcomes in advance. This work combined elements of Eastern medicine and machine learning, and developed a preoperative assessment tool to predict the prognosis of lumbar spinal surgery in LBP and sciatica patients. Standard operative assessments, traditional Chinese medicine body constitution assessments, planned surgical approach, and vowel pronunciation recordings were collected and stored in different modalities. Our work provides insights into leveraging modality combinations, multimodals, and fusion strategies. The interpretability of models and correlations between modalities were also inspected. Based on the recruited 105 patients, we found that combining standard operative assessments, body constitution assessments, and planned surgical approach achieved the best performance in 0.81 accuracy. Our approach is effective and can be widely applied in general practice due to simplicity and effective.
Robot-assisted surgery is rapidly developing in the medical field, and the integration of augmented reality shows the potential of improving the surgeons' operation performance by providing more visual information. In this paper, we proposed a markerless augmented reality framework to enhance safety by avoiding intra-operative bleeding which is a high risk caused by the collision between the surgical instruments and the blood vessel. Advanced stereo reconstruction and segmentation networks are compared to find out the best combination to reconstruct the intra-operative blood vessel in the 3D space for the registration of the pre-operative model, and the minimum distance detection between the instruments and the blood vessel is implemented. A robot-assisted lymphadenectomy is simulated on the da Vinci Research Kit in a dry lab, and ten human subjects performed this operation to explore the usability of the proposed framework. The result shows that the augmented reality framework can help the users to avoid the dangerous collision between the instruments and the blood vessel while not introducing an extra load. It provides a flexible framework that integrates augmented reality into the medical robot platform to enhance safety during the operation.
Aoife McDonald-Bowyer, Solène Dietsch, Emmanouil Dimitrakakis
et al.
In robotic-assisted partial nephrectomy, surgeons remove a part of a kidney often due to the presence of a mass. A drop-in ultrasound probe paired to a surgical robot is deployed to execute multiple swipes over the kidney surface to localise the mass and define the margins of resection. This sub-task is challenging and must be performed by a highly skilled surgeon. Automating this sub-task may reduce cognitive load for the surgeon and improve patient outcomes. The overall goal of this work is to autonomously move the ultrasound probe on the surface of the kidney taking advantage of the use of the Pneumatically Attachable Flexible (PAF) rail system, a soft robotic device used for organ scanning and repositioning. First, we integrate a shape-sensing optical fibre into the PAF rail system to evaluate the curvature of target organs in robotic-assisted laparoscopic surgery. Then, we investigate the impact of the stiffness of the material of the PAF rail on the curvature sensing accuracy, considering that soft targets are present in the surgical field. Finally, we use shape sensing to plan the trajectory of the da Vinci surgical robot paired with a drop-in ultrasound probe and autonomously generate an Ultrasound scan of a kidney phantom.
Regine Hartwig, Daniel Ostler, Jean-Claude Rosenthal
et al.
We propose a new benchmark for evaluating stereoscopic visual-inertial computer vision algorithms (SLAM/ SfM/ 3D Reconstruction/ Visual-Inertial Odometry) for minimally invasive surgical (MIS) interventions in the abdomen. Our MITI Dataset available at [https://mediatum.ub.tum.de/1621941] provides all the necessary data by a complete recording of a handheld surgical intervention at Research Hospital Rechts der Isar of TUM. It contains multimodal sensor information from IMU, stereoscopic video, and infrared (IR) tracking as ground truth for evaluation. Furthermore, calibration for the stereoscope, accelerometer, magnetometer, the rigid transformations in the sensor setup, and time-offsets are available. We wisely chose a suitable intervention that contains very few cutting and tissue deformation and shows a full scan of the abdomen with a handheld camera such that it is ideal for testing SLAM algorithms. Intending to promote the progress of visual-inertial algorithms designed for MIS application, we hope that our clinical training dataset helps and enables researchers to enhance algorithms.
Surgical activity recognition and prediction can help provide important context in many Robot-Assisted Surgery (RAS) applications, for example, surgical progress monitoring and estimation, surgical skill evaluation, and shared control strategies during teleoperation. Transformer models were first developed for Natural Language Processing (NLP) to model word sequences and soon the method gained popularity for general sequence modeling tasks. In this paper, we propose the novel use of a Transformer model for three tasks: gesture recognition, gesture prediction, and trajectory prediction during RAS. We modify the original Transformer architecture to be able to generate the current gesture sequence, future gesture sequence, and future trajectory sequence estimations using only the current kinematic data of the surgical robot end-effectors. We evaluate our proposed models on the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) and use Leave-One-User-Out (LOUO) cross-validation to ensure the generalizability of our results. Our models achieve up to 89.3\% gesture recognition accuracy, 84.6\% gesture prediction accuracy (1 second ahead) and 2.71mm trajectory prediction error (1 second ahead). Our models are comparable to and able to outperform state-of-the-art methods while using only the kinematic data channel. This approach can enable near-real time surgical activity recognition and prediction.