Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering
Jingyue Li, André Storhaug
With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and recommendations designed to empower reproducible, explainable, and effective evaluations of Agentic AI in software engineering. In particular, we recommend that Agentic AI researchers make their Thought-Action-Result (TAR) trajectories and LLM interaction data, or summarized versions of these artifacts, publicly accessible. Doing so will enable subsequent studies to more effectively analyze the strengths and weaknesses of different Agentic AI approaches. To demonstrate the feasibility of such comparisons, we present a proof-of-concept case study that illustrates how TAR trajectories can support systematic analysis across approaches.
Developers in the Age of AI: Adoption, Policy, and Diffusion of AI Software Engineering Tools
Mark Looi
The rapid advance of Generative AI into software development prompts this empirical investigation of perceptual effects on practice. We study the usage patterns of 147 professional developers, examining perceived correlates of AI tools use, the resulting productivity and quality outcomes, and developer readiness for emerging AI-enhanced development. We describe a virtuous adoption cycle where frequent and broad AI tools use are the strongest correlates of both Perceived Productivity (PP) and quality, with frequency strongest. The study finds no perceptual support for the Quality Paradox and shows that PP is positively correlated with Perceived Code Quality (PQ) improvement. Developers thus report both productivity and quality gains. High current usage, breadth of application, frequent use of AI tools for testing, and ease of use correlate strongly with future intended adoption, though security concerns remain a moderate and statistically significant barrier to adoption. Moreover, AI testing tools' adoption lags that of coding tools, opening a Testing Gap. We identify three developer archetypes (Enthusiasts, Pragmatists, Cautious) that align with an innovation diffusion process wherein the virtuous adoption cycle serves as the individual engine of progression. Our findings reveal that organizational adoption of AI tools follows such a process: Enthusiasts push ahead with tools, creating organizational success that converts Pragmatists. The Cautious are held in organizational stasis: without early adopter examples, they don't enter the virtuous adoption cycle, never accumulate the usage frequency that drives intent, and never attain high efficacy. Policy itself does not predict individuals' intent to increase usage but functions as a marker of maturity, formalizing the successful diffusion of adoption by Enthusiasts while acting as a gateway that the Cautious group has yet to reach.
Online Counseling on Global Issues: Systematic Literature Review
Ifdil Ifdil, Muhammad Fahli Zatrahadi, Darmawati Darmawati
et al.
The integration of expertise in counseling with a deep comprehension of contemporary technology is essential. Developing a sustained method is crucial for creating a practical framework to address the psychological ramifications associated with the escalating complexities of global challenges. Therefore, this study was conducted to explore the use and challenges of online counseling (e-counseling) for global issues using the systematic literature review (SLR) method. The search was carried out in the Scopus database to obtain 637 documents after limitations in the year of publication, starting in 2020–2023. Another limitation was the use of the English language, and after quality assessment, a 25-article document analysis was conducted. The results showed that e-counseling was critical in addressing challenges and impacted many individuals in different regions. According to NVivo analysis, the practical implementation of online counseling (e-counseling) encountered several challenges, such as using potentially vulnerable technology, constraints within interpersonal relationships, and incorporating different methods.
Hybrid Tree Tensor Networks for Quantum Simulation
Julian Schuhmacher, Marco Ballarin, Alberto Baiardi
et al.
Hybrid tensor networks (hTNs) offer a promising solution for encoding variational quantum states beyond the capabilities of efficient classical methods or noisy quantum computers alone. However, their practical usefulness and many operational aspects of hTN-based algorithms, like the optimization of hTNs, the generalization of standard contraction rules to an hybrid setting, and the design of application-oriented architectures have not been thoroughly investigated yet. In this work, we introduce a novel algorithm to perform ground-state optimizations with hybrid tree tensor networks (hTTNs), discussing its advantages and roadblocks, and identifying a set of promising applications. We benchmark our approach on two paradigmatic models, namely the Ising model at the critical point and the Toric-code Hamiltonian. In both cases, we successfully demonstrate that hTTNs can improve upon classical equivalents with equal bond dimension in the classical part.
Physics, Computer software
From ideal to practical: Heterogeneity of student-generated variant lists highlights hidden reproducibility gaps.
Rumeysa Aslıhan Ertürk, Abdullah Asım Emül, Büşra Nur Darendeli-Kiraz
et al.
Next-generation sequencing (NGS) technologies offer detailed and inexpensive identification of the genetic structure of living organisms. The massive data volume necessitates the utilization of advanced computational resources for analyses. However, the rapid accumulation of data and the urgent need for analysis tools have caused the development of imperfect software solutions. Given their immense potential in clinical applications and the recent reproducibility crisis discussions in science and technology, these tools must be thoroughly examined. Typically, NGS data analysis tools are benchmarked under homogeneous conditions, with well-trained personnel and ideal hardware and data environments. However, in the real world, these analyses are done under heterogeneous conditions in terms of computing environments and experience levels. This difference is mostly overlooked, therefore studies that examine NGS workflows generated under various conditions would be highly valuable. Moreover, a detailed assessment of the difficulties faced by the trainees would allow for improved educational programs for better NGS analysis training. Considering these needs, we designed an elective undergraduate bioinformatics course project for computer engineering students at Istanbul Technical University. Students were tasked to perform and compare 12 different somatic variant calling pipelines on the recently published SEQC2 dataset. Upon examining the results, we have realized that despite seeming correct, the final variant lists created by different student groups display a high level of heterogeneity. Notably, the operating systems and installation methods were the most influential factors in variant-calling performance. Here, we present detailed evaluations of our case study and provide insights for better bioinformatics training.
Bio-Volumentations: A Python library for augmentation of volumetric image sequences
Lucia Hradecká, Filip Lux, Samuel Šul’an
et al.
Data augmentation is a widely used technique to increase generalization ability of deep learning models, especially when dealing with sparse training data. It is also crucial in biomedical applications, where annotated images are extremely rare due to high image dimensionality and expensive data acquisition processes. However, existing image augmentation toolboxes are not suitable for biomedical applications: they usually only support low-dimensional images or very few annotation types. To address this issue, we developed Bio-Volumentations—a Python library for transforming multidimensional biomedical images with image- and point-based annotations. Thanks to its universality, user-friendly interface, and independence of deep learning toolboxes, it facilitates efficient data preprocessing and augmentation in various computer vision tasks.
Survey of Adversarial Attack and Defense for RBG and Infrared Multimodal Object Detection
ZHENG Haibin, LIN Xiuhao, CHEN Jingwen, CHEN Jinyin
Object detection,as a fundamental classic task in the field of computer vision,has a wide range of applications.Deep learning based object detection algorithms have become the mainstream of current research due to their superior performance.However,most object detection algorithms only perform single-mode detection on visible or infrared images.In general,visible images have poor imaging in harsh weather,nighttime,and scenes,where targets are obstructed,leading to a decrease in detection performance.The use of infrared images can improve the above issues,but infrared images may miss some details of the target.Therefore,multimodal fusion detection algorithms based on visible light and infrared images are gradually emerging.However,existing research has focused on improving the performance of multimodal object detection algorithms,and research on their security is relatively scattered.Based on existing research work,this paper provides an overview of the security of multimodal object detection in adversarial situations.Firstly,a theoretical analysis of multimodal object detection and attack and defense is conducted.Secondly,multimodal object detection methods are classified and summarized according to fusion detection in different time periods.Then,existing methods of object detection and adversarial defense are summarized and organized,and the existing dataset and main evaluation indicators of multimodal object detection are summarized.Finally,potential research directions for multi-modal object detection in the future are discussed,further promoting the development and application of multimodal object detection in adversarial security research.
Computer software, Technology (General)
Quantum annealing-assisted lattice optimization
Zhihao Xu, Wenjie Shang, Seongmin Kim
et al.
Abstract High Entropy Alloys (HEAs) have drawn great interest due to their exceptional properties compared to conventional materials. The configuration of HEA system is considered a key to their superior properties, but exhausting all possible configurations of atom coordinates and species to find the ground energy state is extremely challenging. In this work, we proposed a quantum annealing-assisted lattice optimization (QALO) algorithm, which is an active learning framework that integrates the Field-aware Factorization Machine (FFM) as the surrogate model for lattice energy prediction, Quantum Annealing (QA) as an optimizer and Machine Learning Potential (MLP) for ground truth energy calculation. By applying our algorithm to the NbMoTaW alloy, we reproduced the Nb depletion and W enrichment observed in bulk HEA. We found our optimized HEAs to have superior mechanical properties compared to the randomly generated alloy configurations. Our algorithm highlights the potential of quantum computing in materials design and discovery, laying a foundation for further exploring and optimizing structure-property relationships.
Materials of engineering and construction. Mechanics of materials, Computer software
Automating Docker image deployment across network-segmented environments
Guillermo Bermejo, Ángel Macías, Juan Luis Herrera
et al.
In security-sensitive or regulated environments — such as banking, healthcare, or industrial control systems — strict network segmentation policies prevent direct communication between development and production infrastructure. As a result, software delivery processes in these contexts often rely on manual workflows, including detecting new Docker images, transferring them across isolated domains, and manually applying deployment updates. This paper presents a self-managed, lightweight CI/CD framework specifically designed for such disconnected environments. Rather than managing containers directly, the system automates a critical subset of the DevOps workflow: the detection, transfer, and deployment of updated Docker images across network-isolated zones. It operates from a bastion host with access to both segments, utilizing open-source tools: Diun for monitoring external Docker registries, Skopeo for transferring images securely between registries, and ‘kubectl‘ for updating the corresponding Kubernetes deployments. Notifications are sent via Postfix to maintain traceability at every stage of the process. The main contribution of this work lies in adapting DevOps automation principles to segmented infrastructures without relying on cloud services or central control, a scenario largely unsupported by existing tools. The proposed solution requires no internet access, cloud platforms, or third-party services, making it suitable for environments with strict connectivity restrictions. It is modular, reproducible, and vendor-neutral. Validation in a simulated enterprise scenario confirms the system’s reliability across both successful and failure cases. By targeting the image propagation stage of the deployment pipeline, this work contributes a practical, focused automation tool for CI/CD under constrained network conditions. Source code and deployment artifacts are publicly available to facilitate reuse in similarly restricted environments.
Small Object Detection Algorithm for Aerial Photography Based on Improved YOLOv3
XI Qi, WANG Mingjie, WEI Jinghe, ZHAO Wei
This study presents an improved You Only Look Once version 3 (YOLOv3) algorithm for small object detection, to address problems such as low detection precision for small objects, missed detection, and false detection in the detection process. First, in terms of network structure, the feature extraction capability of the backbone network is improved by using DenseNet-121, with a Densely Connected Network (DenseNet), to replace the original Darknet-53 network as its basic network. Simultaneously, the convolution kernel size is modified to further reduce the loss of feature map information, to enhance the robustness of the detection model against small objects. A fourth feature detection layer with a size of 104×104 pixel is added. Second, the bilinear interpolation method is used to replace the original nearest neighbor interpolation method for upsampling operations, to solve the serious feature loss problem in most detection algorithms. Finally, in terms of the loss function, Generalized Intersection over Union (GIoU) is used instead of Intersection over Union (IoU) to calculate the loss value of the boundary frame, and the Focal Loss function is introduced as the confidence loss function of the boundary frame. Experimental results show that the mean Average Precision (mAP) of the improved algorithm on the VisDrone2019 dataset is 63.3%, which is 13.2 percentage points higher than that of the original YOLOv3 detection model, and 52 frame/s on a GTX 1080 Ti device. The improved algorithm has good detection performance for small objects.
Computer engineering. Computer hardware, Computer software
SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
Zhimin Zhao
Foundation models (FMs), particularly large language models (LLMs), have shown significant promise in various software engineering (SE) tasks, including code generation, debugging, and requirement refinement. Despite these advances, existing evaluation frameworks are insufficient for assessing model performance in iterative, context-rich workflows characteristic of SE activities. To address this limitation, we introduce \emph{SWE-Arena}, an interactive platform designed to evaluate FMs in SE tasks. SWE-Arena provides a transparent, open-source leaderboard, supports multi-round conversational workflows, and enables end-to-end model comparisons. The platform introduces novel metrics, including \emph{model consistency score} that measures the consistency of model outputs through self-play matches, and \emph{conversation efficiency index} that evaluates model performance while accounting for the number of interaction rounds required to reach conclusions. Moreover, SWE-Arena incorporates a new feature called \emph{RepoChat}, which automatically injects repository-related context (e.g., issues, commits, pull requests) into the conversation, further aligning evaluations with real-world development processes. This paper outlines the design and capabilities of SWE-Arena, emphasizing its potential to advance the evaluation and practical application of FMs in software engineering.
AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software
Rui Yang, Michael Fu, Chakkrit Tantithamthavorn
et al.
Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.
Advanced federated ensemble internet of learning approach for cloud based medical healthcare monitoring system
Rahim Khan, Sher Taj, Xuefei Ma
et al.
Abstract Medical image machines serve as a valuable tool to monitor and diagnose a variety of diseases. However, manual and centralized interpretation are both error-prone and time-consuming due to malicious attacks. Numerous diagnostic algorithms have been developed to improve precision and prevent poisoning attacks by integrating symptoms, test methods, and imaging data. But in today’s digital technology world, it is necessary to have a global cloud-based diagnostic artificial intelligence model that is efficient in diagnosis and preventing poisoning attacks and might be used for multiple purposes. We propose the Healthcare Federated Ensemble Internet of Learning Cloud Doctor System (FDEIoL) model, which integrates different Internet of Things (IoT) devices to provide precise and accurate interpretation without poisoning attack problems, thereby facilitating IoT-enabled remote patient monitoring for smart healthcare systems. Furthermore, the FDEIoL system model uses a federated ensemble learning strategy to provide an automatic, up-to-date global prediction model based on input local models from the medical specialist. This assures biomedical security by safeguarding patient data and preserving the integrity of diagnostic processes. The FDEIoL system model utilizes local model feature selection to discriminate between malicious and non-malicious local models, and ensemble strategies use positive and negative samples to optimize the performance of the test dataset, enhancing its capability for remote patient monitoring. The FDEIoL system model achieved an exceptional accuracy rate of 99.24% on the Chest X-ray dataset and 99.0% on the MRI dataset of brain tumors compared to centralized models, demonstrating its ability for precision diagnosis in IoT-enabled healthcare systems.
Scalable training on scalable infrastructures for programmable hardware
Lorusso Marco, Bonacorsi Daniele, Travaglini Riccardo
et al.
Machine learning (ML) and deep learning (DL) techniques are increasingly influential in High Energy Physics, necessitating effective computing infrastructures and training opportunities for users and developers, particularly concerning programmable hardware like FPGAs. A gap exists in accessible ML/DL on FPGA tutorials catering to diverse hardware specifications. To bridge this gap, collaborative efforts by INFN-Bologna, the University of Bologna, and INFN-CNAF produced a pilot course using virtual machines, inhouse cloud platforms, and AWS instances, utilizing Docker containers for interactive exercises. Additionally, the Bond Machine software ecosystem, capable of generating FPGA-synthesizable computer architectures, is explored as a simplified approach for teaching FPGA programming.
Artificial Intelligence Techniques and Pedigree Charts in Oncogenetics: Towards an Experimental Multioutput Software System for Digitization and Risk Prediction
Luana Conte, Emanuele Rizzo, Tiziana Grassi
et al.
Pedigree charts remain essential in oncological genetic counseling for identifying individuals with an increased risk of developing hereditary tumors. However, this valuable data source often remains confined to paper files, going unused. We propose a computer-aided detection/diagnosis system, based on machine learning and deep learning techniques, capable of the following: (1) assisting genetic oncologists in digitizing paper-based pedigree charts, and in generating new digital ones, and (2) automatically predicting the genetic predisposition risk directly from these digital pedigree charts. To the best of our knowledge, there are no similar studies in the current literature, and consequently, no utilization of software based on artificial intelligence on pedigree charts has been made public yet. By incorporating medical images and other data from omics sciences, there is also a fertile ground for training additional artificial intelligence systems, broadening the software predictive capabilities. We plan to bridge the gap between scientific advancements and practical implementation by modernizing and enhancing existing oncological genetic counseling services. This would mark the pioneering development of an AI-based application designed to enhance various aspects of genetic counseling, leading to improved patient care and advancements in the field of oncogenetics.
Electronic computers. Computer science
Learning the optimal power flow: Environment design matters
Thomas Wolgast, Astrid Nieße
To solve the optimal power flow (OPF) problem, reinforcement learning (RL) emerges as a promising new approach. However, the RL-OPF literature is strongly divided regarding the exact formulation of the OPF problem as an RL environment. In this work, we collect and implement diverse environment design decisions from the literature regarding training data, observation space, episode definition, and reward function choice. In an experimental analysis, we show the significant impact of these environment design options on RL-OPF training performance. Further, we derive some first recommendations regarding the choice of these design decisions. The created environment framework is fully open-source and can serve as a benchmark for future research in the RL-OPF field.
Electrical engineering. Electronics. Nuclear engineering, Computer software
"So Am I Dr. Frankenstein? Or Were You a Monster the Whole Time?": Mitigating Software Project Failure With Loss-Aversion-Aware Development Methodologies
Junade Ali
Case studies have shown that software disasters snowball from technical issues to catastrophes through humans covering up problems rather than addressing them and empirical research has found the psychological safety of software engineers to discuss and address problems to be foundational to improving project success. However, the failure to do so can be attributed to psychological factors like loss aversion. We conduct a large-scale study of the experiences of 600 software engineers in the UK and USA on project success experiences. Empirical evaluation finds that approaches like ensuring clear requirements before the start of development, when loss aversion is at its lowest, correlated to 97% higher project success. The freedom of software engineers to discuss and address problems correlates with 87% higher success rates. The findings support the development of software development methodologies with a greater focus on human factors in preventing failure.
An Overview and Catalogue of Dependency Challenges in Open Source Software Package Registries
Tom Mens, Alexandre Decan
While open-source software has enabled significant levels of reuse to speed up software development, it has also given rise to the dreadful dependency hell that all software practitioners face on a regular basis. This article provides a catalogue of dependency-related challenges that come with relying on OSS packages or libraries. The catalogue is based on the scientific literature on empirical research that has been conducted to understand, quantify and overcome these challenges. Our overview of this very active research field of package dependency management can be used as a starting point for junior and senior researchers as well as practitioners that would like to learn more about research advances in dealing with the challenges that come with the dependency networks of large OSS package registries.
Gender Differences in Self-Efficacy and Attitudes toward Computers
T. Busch
Intelligent Retrieval of Radar Reflectivity Factor With Privacy Protection Under Meteorological Satellite Remote Sensing
Huichao Lin, Xiaolong Xu, Muhammad Bilal
et al.
Meteorological radar data are essential for meteorological monitoring, forecasting, and research, and it plays a crucial role in observing and warning of extreme weather risks. However, meteorological radars have some limitations, such as uneven distribution and severe topographical influence. Meteorological remote sensing satellites can partially overcome these limitations by providing larger observational scope and high spatial and temporal resolution. Using data from meteorological remote sensing satellites to train radar reflectivity factor retrieval models can effectively compensate for the missing and poor quality of radar data. However, there are still some challenges, such as extracting the features of intense convective weather with unclear coverage from complex multichannel meteorological remote sensing satellite data and removing the interference caused by nonprecipitation clouds on retrieval models. Moreover, the privacy and security of remote sensing data transmission need to be ensured. In this article, we propose a novel method that combines the advanced encryption standard method to protect the transmission of remote sensing data, a multiscale feature fusion module to extract multiscale features from multichannel meteorological remote sensing satellite data, and an attention technique to reduce the interference of nonprecipitation clouds on retrieval models. We conduct comparison experiments with multiple indicators to demonstrate that our method has certain advantages in retrieving radar reflectivity values of different sizes. Our method achieves 0.63, 0.36, 0.49, 0.55, and 0.99 on probability of detection, false alarm ratio, critical success index, Heidke skill score, and accuracy scores, respectively.
Ocean engineering, Geophysics. Cosmic physics