With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and recommendations designed to empower reproducible, explainable, and effective evaluations of Agentic AI in software engineering. In particular, we recommend that Agentic AI researchers make their Thought-Action-Result (TAR) trajectories and LLM interaction data, or summarized versions of these artifacts, publicly accessible. Doing so will enable subsequent studies to more effectively analyze the strengths and weaknesses of different Agentic AI approaches. To demonstrate the feasibility of such comparisons, we present a proof-of-concept case study that illustrates how TAR trajectories can support systematic analysis across approaches.
Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a critical factor in system performance and behavior. Despite their growing role in SE research, prompt-related decisions are rarely documented in a systematic or transparent manner, hindering reproducibility and comparability across studies. To address this gap, we conducted a two-phase empirical study. First, we analyzed nearly 300 papers published at the top-3 SE conferences since 2022 to assess how prompt design, testing, and optimization are currently reported. Second, we surveyed 105 program committee members from these conferences to capture their expectations for prompt reporting in LLM-driven research. Based on the findings, we derived a structured guideline that distinguishes essential, desirable, and exceptional reporting elements. Our results reveal significant misalignment between current practices and reviewer expectations, particularly regarding version disclosure, prompt justification, and threats to validity. We present our guideline as a step toward improving transparency, reproducibility, and methodological rigor in LLM-based SE research.
Tanja E. J. Vos, Tijs van der Storm, Alexander Serebrenik
et al.
Software engineering is the invisible infrastructure of the digital age. Every breakthrough in artificial intelligence, quantum computing, photonics, and cybersecurity relies on advances in software engineering, yet the field is too often treated as a supportive digital component rather than as a strategic, enabling discipline. In policy frameworks, including major European programmes, software appears primarily as a building block within other technologies, while the scientific discipline of software engineering remains largely absent. This position paper argues that the long-term sustainability, dependability, and sovereignty of digital technologies depend on investment in software engineering research. It is a call to reclaim the identity of software engineering.
Mairieli Wessel, Daniel Feitosa, Sangeeth Kochanthara
Rising publication pressure and the routine use of generative AI tools are reshaping how software engineering research is produced, assessed, and taught. While these developments promise efficiency, they also raise concerns about skill degradation, responsibility, and trust in scholarly outputs. This vision paper employs Design Fiction as a methodological lens to examine how such concerns might materialise if current practices persist. Drawing on themes reported in a recent community survey, we construct a speculative artifact situated in a near future research setting. The fiction is used as an analytical device rather than a forecast, enabling reflection on how automated assistance might impede domain knowledge competence, verification, and mentoring practices. By presenting an intentionally unsettling scenario, the paper invites discussion on how the software engineering research community in the future will define proficiency, allocate responsibility, and support learning.
In high-speed hydraulic machinery, its efficiency and reliability are affected by the cavitation in the bearing. Due to the confined effect of the bearing clearance, cavitation bubbles grow in a two-dimensional way. To uncover the cavitation process with confined and high speed shearing effect, the high-speed cavitation flowing of different refrigerants is researched experimentally based on the high-speed shearing test rig with micro-clearance. The influence of thermophysical properties on growth of cavitation bubble is evaluated and analyzed. The confined effect of micro-clearance and high-speed shearing effect has a significant influence on the cavitation bubbles evolution. The high-speed camera is used to record the morphology of cavitation bubbles for various refrigerants with different thermalphysical properties. Furthermore, the thermal-sensitive cavitation model is used to analyze the bubble-foam alternation from cavitation flow inside micro-clearance. For different refrigerants, the growth process of cavitation bubble area is exponential. Inside the micro-clearance, the cavitation inducing pressure drops of different refrigerants are analogous due to the similar thermodynamic properties. According to pressure drop during cavitation, different refrigerants are classified by introducing dimensionless numbers, σ·Re (Jie et al., 2009) [2] and σ·We. The pressure and temperature drop increase with the dimensionless numbers. The refrigerants with similar thermodynamic properties have a similar relationship between dimensionless number and supercooling degree.
Defects in pipes adversely affect both the jacking construction process and long-term operational safety, yet their specific impacts on mechanical properties remain unclear. This study investigates pipe jacking segments under deflection, using the Changsha Meixi Lake project as a case study. Similar model tests combined with digital image correlation were employed to examine the evolution of stress and deformation under various deflection angles and defect conditions. The reliability of the laboratory tests was verified through theoretical stress calculations under the non-deflection condition. The credibility of the laboratory test results was further enhanced by employing a numerical model and normalized parameters. Key findings reveal that stress distribution characteristics are jointly determined by the deflection mode and load. Co-directional deflection exhibits a more significant stress concentration effect; under identical load and angle conditions, it results in higher stress levels due to a superposition effect, whereas diagonal deflection shows a weakening effect. Joint deformation progresses through three distinct stages. The linear growth stage exhibits an initial linear strain–load relationship under stable deflection (load < 2 kN). The accelerated deformation stage is characterized by nonlinear strain growth with a slowing deformation rate (2–4 kN). The deformation deceleration stage finally shows a slow linear strain increment (load > 4 kN). Increasing load and deflection angle significantly amplify axial deformation, particularly revealing a “thick-in-the-middle, thin-at-the-sides” compression characteristic in the 45° vault zones. Furthermore, segment defects markedly exacerbate stress non-uniformity. Defect angles ≥ 60° substantially increase the frequency and amplitude of compressive stress in the vault, accelerate the decay of tensile stress at the bottom, and critically reduce structural stability. These new findings provide significant insights for deflection control and structural safety assessment in pipe jacking engineering. The experimental framework provides fundamental insights into construction operations in upper-soft and lower-hard strata tunneling.
Qiaolin Qin, Ronnie de Souza Santos, Rodrigo Spinola
Context. The rise of generative AI (GenAI) tools like ChatGPT and GitHub Copilot has transformed how software is learned and written. In software engineering (SE) education, these tools offer new opportunities for support, but also raise concerns about over-reliance, ethical use, and impacts on learning. Objective. This study investigates how undergraduate SE students use GenAI tools, focusing on the benefits, challenges, ethical concerns, and instructional expectations that shape their experiences. Method. We conducted a survey with 130 undergraduate students from two universities. The survey combined structured Likert-scale items and open-ended questions to investigate five dimensions: usage context, perceived benefits, challenges, ethical and instructional perceptions. Results. Students most often use GenAI for incremental learning and advanced implementation, reporting benefits such as brainstorming support and confidence-building. At the same time, they face challenges including unclear rationales and difficulty adapting outputs. Students highlight ethical concerns around fairness and misconduct, and call for clearer instructional guidance. Conclusion. GenAI is reshaping SE education in nuanced ways. Our findings underscore the need for scaffolding, ethical policies, and adaptive instructional strategies to ensure that GenAI supports equitable and effective learning.
Mohammed Latif Siddiq, Arvin Islam-Gomes, Natalie Sekerak
et al.
Reproducibility is a cornerstone of scientific progress, yet its state in large language model (LLM)-based software engineering (SE) research remains poorly understood. This paper presents the first large-scale, empirical study of reproducibility practices in LLM-for-SE research. We systematically mined and analyzed 640 papers published between 2017 and 2025 across premier software engineering, machine learning, and natural language processing venues, extracting structured metadata from publications, repositories, and documentation. Guided by four research questions, we examine (i) the prevalence of reproducibility smells, (ii) how reproducibility has evolved over time, (iii) whether artifact evaluation badges reliably reflect reproducibility quality, and (iv) how publication venues influence transparency practices. Using a taxonomy of seven smell categories: Code and Execution, Data, Documentation, Environment and Tooling, Versioning, Model, and Access and Legal, we manually annotated all papers and associated artifacts. Our analysis reveals persistent gaps in artifact availability, environment specification, versioning rigor, and documentation clarity, despite modest improvements in recent years and increased adoption of artifact evaluation processes at top SE venues. Notably, we find that badges often signal artifact presence but do not consistently guarantee execution fidelity or long-term reproducibility. Motivated by these findings, we provide actionable recommendations to mitigate reproducibility smells and introduce a Reproducibility Maturity Model (RMM) to move beyond binary artifact certification toward multi-dimensional, progressive evaluation of reproducibility rigor.
The paper entitled "Qualitative Methods in Empirical Studies of Software Engineering" by Carolyn Seaman was published in TSE in 1999. It has been chosen as one of the most influential papers from the third decade of TSE's 50 years history. In this retrospective, the authors discuss the evolution of the use of qualitative methods in software engineering research, the impact it's had on research and practice, and reflections on what is coming and deserves attention.
Systems engineering (SE) is evolving with the availability of generative artificial intelligence (AI) and the demand for a systems-of-systems perspective, formalized under the purview of mission engineering (ME) in the US Department of Defense. Formulating ME problems is challenging because they are open-ended exercises that involve translation of ill-defined problems into well-defined ones that are amenable for engineering development. It remains to be seen to which extent AI could assist problem formulation objectives. To that end, this paper explores the quality and consistency of multi-purpose Large Language Models (LLM) in supporting ME problem formulation tasks, specifically focusing on stakeholder identification. We identify a relevant reference problem, a NASA space mission design challenge, and document ChatGPT-3.5's ability to perform stakeholder identification tasks. We execute multiple parallel attempts and qualitatively evaluate LLM outputs, focusing on both their quality and variability. Our findings portray a nuanced picture. We find that the LLM performs well in identifying human-focused stakeholders but poorly in recognizing external systems and environmental factors, despite explicit efforts to account for these. Additionally, LLMs struggle with preserving the desired level of abstraction and exhibit a tendency to produce solution specific outputs that are inappropriate for problem formulation. More importantly, we document great variability among parallel threads, highlighting that LLM outputs should be used with caution, ideally by adopting a stochastic view of their abilities. Overall, our findings suggest that, while ChatGPT could reduce some expert workload, its lack of consistency and domain understanding may limit its reliability for problem formulation tasks.
Shavindra Wickramathilaka, John Grundy, Kashumi Madampe
et al.
The use of diverse mobile applications among senior users is becoming increasingly widespread. However, many of these apps contain accessibility problems that result in negative user experiences for seniors. A key reason is that software practitioners often lack the time or resources to address the broad spectrum of age-related accessibility and personalisation needs. As current developer tools and practices encourage one-size-fits-all interfaces with limited potential to address the diversity of senior needs, there is a growing demand for approaches that support the systematic creation of adaptive, accessible app experiences. To this end, we present AdaptForge, a novel model-driven engineering (MDE) approach that enables advanced design-time adaptations of mobile application interfaces and behaviours tailored to the accessibility needs of senior users. AdaptForge uses two domain-specific languages (DSLs) to address age-related accessibility needs. The first model defines users' context-of-use parameters, while the second defines conditional accessibility scenarios and corresponding UI adaptation rules. These rules are interpreted by an MDE workflow to transform an app's original source code into personalised instances. We also report evaluations with professional software developers and senior end-users, demonstrating the feasibility and practical utility of AdaptForge.
Eduard C. Groen, Kazi Rezoanur Rahman, Nikita Narsinghani
et al.
The farming domain has seen a tremendous shift towards digital solutions. However, capturing farmers' requirements regarding Digital Farming (DF) technology remains a difficult task due to domain-specific challenges. Farmers form a diverse and international crowd of practitioners who use a common pool of agricultural products and services, which means we can consider the possibility of applying Crowd-based Requirements Engineering (CrowdRE) for DF: CrowdRE4DF. We found that online user feedback in this domain is limited, necessitating a way of capturing user feedback from farmers in situ. Our solution, the Farmers' Voice application, uses speech-to-text, Machine Learning (ML), and Web 2.0 technology. A preliminary evaluation with five farmers showed good technology acceptance, and accurate transcription and ML analysis even in noisy farm settings. Our findings help to drive the development of DF technology through in-situ requirements elicitation.
Electronic equipment is indispensable in the industrial 4.0 era. Electromagnetic Compatibility issues with electronic devices are increasingly concerning. The phenomenon of electromagnetic field compatibility is getting higher and higher. The operating quality of electronic equipment is more and more adversely affected, such as by the phenomenon of hesitation in operation for the operating structures, the generation of fire and explosion of electrical equipment, the loss of information, and many other negative effects. This paper discusses the relationship between Electromagnetic Compatibility (EMC) scoring, Electromagnetic Interference (EMI) scoring, and Electromagnetic Susceptibility (EMS) scoring with the performance quality of electronic devices (QUA). We perform reviews on regulatory institutions governing Electromagnetic Compatibility on electronic devices. To evaluate the proposed Electromagnetic Compatibility structure and its relationship to electronic devices, we proposed to use the Partial Least Squares Structural Equation Modeling (PLS-SEM) method. The research results of the model show that the electronic device layout conditions and the lack of systematic conditions have a negative impact on the operating quality of the electronic equipment, while the conditions on equipment techniques, scientific and technological resources have positive and significant impacts.
This review presents an up-to-date account of research in multi-hazard assessment and vibration control of engineering structures. A general discussion of the importance of multi-hazard consideration in structural engineering, as well as recent advances in this area, is presented as a background. In terms of performance assessment and vibration control, various hazards are considered with an emphasis on seismic and wind loads. Although multi-hazard problems in civil engineering structures are generally discussed to some extent, the emphasis is placed on buildings, bridges, and wind turbine towers. The scientific literature in this area is vast with rapidly growing innovations. The literature is, therefore, classified by the structure type, and then, subsequently, by the hazard. Main contributions and conclusions from the reported studies are presented in summarized tables intended to provide readers with a quick reference and convenient navigation to related publications for further research. Finally, a summary of the literature review is provided with some insights on knowledge gaps and research needs.