Daniel Mendez, Paris Avgeriou, Marcos Kalinowski
et al.
Empirical Software Engineering has received much attention in recent years and became a de-facto standard for scientific practice in Software Engineering. However, while extensive guidelines are nowadays available for designing, conducting, reporting, and reviewing empirical studies, similar attention has not yet been paid to teaching empirical software engineering. Closing this gap is the scope of this edited book. In the following editorial introduction, we, the editors, set the foundation by laying out the larger context of the discipline for a positioning of the remainder of this book.
Maximilian Reisecker, Cynthia Marcelino, Thomas Pusztai
et al.
Serverless computing offers elastic scaling and pay-per-use execution, making it well-suited for AI workloads. As these workloads run in heterogeneous environments such as the Edge-Cloud-Space 3D Continuum, they often require intensive parallel computation, which GPUs can perform far more efficiently than CPUs. However, current platforms struggle to manage hardware acceleration effectively, as static user-device assignments fail to ensure SLO compliance under varying loads or placements, and one-time dynamic selections often lead to suboptimal or cost-inefficient configurations. To address these issues, we present Gaia, a GPU-as-a-service model and architecture that makes hardware acceleration a platform concern. Gaia combines (i) a lightweight Execution Mode Identifier that inspects function code at deploy time to emit one of four execution modes, and a Dynamic Function Runtime that continuously reevaluates user-defined SLOs to promote or demote between CPU- and GPU backends. Our evaluation shows that it seamlessly selects the best hardware acceleration for the workload, reducing end-to-end latency by up to 95%. These results indicate that Gaia enables SLO-aware, cost-efficient acceleration for serverless AI across heterogeneous environments.
Shvetank Prakash, Andrew Cheng, Arya Tschand
et al.
The field of computer architecture, which bridges high-level software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations. To this end, we present QuArch (pronounced 'quark'), the first benchmark designed to facilitate the development and evaluation of LLM knowledge and reasoning capabilities specifically in computer architecture. QuArch provides a comprehensive collection of 2,671 expert-validated question-answer (QA) pairs covering various aspects of computer architecture, including processor design, memory systems, and interconnection networks. Our evaluation reveals that while frontier models possess domain-specific knowledge, they struggle with skills that require higher-order thinking in computer architecture. Frontier model accuracies vary widely (from 34% to 72%) on these advanced questions, highlighting persistent gaps in architectural reasoning across analysis, design, and implementation QAs. By holistically assessing fundamental skills, QuArch provides a foundation for building and measuring LLM capabilities that can accelerate innovation in computing systems. With over 140 contributors from 40 institutions, this benchmark represents a community effort to set the standard for architectural reasoning in LLM evaluation.
We like and need Information and Communications Technologies (ICT) for data processing. This is measureable in the exponential growth of data processed by ICT, e.g. ICT for cryptocurrency mining and search engines. So far, the energy demand for computing technology has increased by a factor of 1.38 every ten years due to the exponentially increasing use of ICT systems as computing devices. The energy consumption of ICT systems is expected to rise from 1500 TWh (8% of global electricity consumption) in 2010 to 5700 TWh (14% of global electricity consumption) in 2030. A large part of this energy is required for the continuous data transfer between the separated memory and processor units which constitute the main components of ICT computing devices in von-Neumann architecture. This at the same time massively slows down the computing power of ICT systems in the von-Neumann architecture. In addition, due to the increasing complexity of AI compute algorithms, since 2010 the AI training compute time demand for computing technology increases tenfold every year, for example in the period from 2010 to 2020 from 1x10^{-6} to 1x10^{+4} Petaflops/Day. It has been theoretically predicted that ICT systems in the neuromorphic computer architecture will circumvent all of this through the use of merged memory and processor units. However, the core hardware element for this has not yet been realized so far. In this work we discuss the prespectives for non-linear resistive switches as the core hardware element for merged memory and processor units in neuromorphic computers.
Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh
Deep learning (DL) techniques have achieved significant success in various software engineering tasks (e.g., code completion by Copilot). However, DL systems are prone to bugs from many sources, including training data. Existing literature suggests that bugs in training data are highly prevalent, but little research has focused on understanding their impacts on the models used in software engineering tasks. In this paper, we address this research gap through a comprehensive empirical investigation focused on three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based. Using state-of-the-art baselines, we compare the models trained on clean datasets with those trained on datasets with quality issues and without proper preprocessing. By analysing the gradients, weights, and biases from neural networks under training, we identify the symptoms of data quality and preprocessing issues. Our analysis reveals that quality issues in code data cause biased learning and gradient instability, whereas problems in text data lead to overfitting and poor generalisation of models. On the other hand, quality issues in metric data result in exploding gradients and model overfitting, and inadequate preprocessing exacerbates these effects across all three data types. Finally, we demonstrate the validity and generalizability of our findings using six new datasets. Our research provides a better understanding of the impact and symptoms of data bugs in software engineering datasets. Practitioners and researchers can leverage these findings to develop better monitoring systems and data-cleaning methods to help detect and resolve data bugs in deep learning systems.
Amir Hossein Jalilvand, Faeze S. Banitaba, Seyedeh Newsha Estiri
et al.
Sorting is a fundamental operation in various applications and a traditional research topic in computer science. Improving the performance of sorting operations can have a significant impact on many application domains. For high-performance sorting, much attention has been paid to hardware-based solutions. These are often realized with application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs). Recently, in-memory sorting solutions have also been proposed to address the movement cost issue between memory and processing units, also known as Von Neumann bottleneck. Due to the complexity of the sorting algorithms, achieving an efficient hardware implementation for sorting data is challenging. A large body of prior solutions is built on compare-and-swap (CAS) units. These are categorized as comparison-based sorting. Some recent solutions offer comparison-free sorting. In this survey, we review the latest works in the area of hardware-based sorting. We also discuss the recent hardware solutions for partial and stream sorting. Finally, we will discuss some important concerns that need to be considered in the future designs of sorting systems.
Emerging AI applications such as ChatGPT, graph convolutional networks, and other deep neural networks require massive computational resources for training and inference. Contemporary computing platforms such as CPUs, GPUs, and TPUs are struggling to keep up with the demands of these AI applications. Non-coherent optical computing represents a promising approach for light-speed acceleration of AI workloads. In this paper, we show how cross-layer design can overcome challenges in non-coherent optical computing platforms. We describe approaches for optical device engineering, tuning circuit enhancements, and architectural innovations to adapt optical computing to a variety of AI workloads. We also discuss techniques for hardware/software co-design that can intelligently map and adapt AI software to improve its performance on non-coherent optical computing platforms.
Luca Ballotta, Giovanni Peserico, Francesco Zanini
et al.
We consider a network of smart sensors for an edge computing application that sample a time-varying signal and send updates to a base station for remote global monitoring. Sensors are equipped with sensing and compute, and can either send raw data or process them on-board before transmission. Limited hardware resources at the edge generate a fundamental latency-accuracy trade-off: raw measurements are inaccurate but timely, whereas accurate processed updates are available after processing delay. Hence, one needs to decide when sensors should transmit raw measurements or rely on local processing to maximize network monitoring performance. To tackle this sensing design problem, we model an estimation-theoretic optimization framework that embeds both computation and communication latency, and propose a Reinforcement Learning-based approach that dynamically allocates computational resources at each sensor. Effectiveness of our proposed approach is validated through numerical experiments motivated by smart sensing for the Internet of Drones and self-driving vehicles. In particular, we show that, under constrained computation at the base station, monitoring performance can be further improved by an online sensor selection.
We study the parameterized complexity of computing the tree-partition-width, a graph parameter equivalent to treewidth on graphs of bounded maximum degree. On one hand, we can obtain approximations of the tree-partition-width efficiently: we show that there is an algorithm that, given an $n$-vertex graph $G$ and an integer $k$, constructs a tree-partition of width $O(k^7)$ for $G$ or reports that $G$ has tree-partition-width more than $k$, in time $k^{O(1)}n^2$. We can improve slightly on the approximation factor by sacrificing the dependence on $k$, or on $n$. On the other hand, we show the problem of computing tree-partition-width exactly is XALP-complete, which implies that it is $W[t]$-hard for all $t$. We deduce XALP-completeness of the problem of computing the domino treewidth. Next, we adapt some known results on the parameter tree-partition-width and the topological minor relation, and use them to compare tree-partition-width to tree-cut width. Finally, for the related parameter weighted tree-partition-width, we give a similar approximation algorithm (with ratio now $O(k^{15})$) and show XALP-completeness for the special case where vertices and edges have weight 1.
Danyllo Valente da Silva, Bruno Pedraça de Souza, Taisa Guidini Gonçalves
et al.
Contemporary software systems (CSS), such as the internet of things (IoT) based software systems, incorporate new concerns and characteristics inherent to the network, software, hardware, context awareness, interoperability, and others, compared to conventional software systems. In this sense, requirements engineering (RE) plays a fundamental role in ensuring these software systems' correct development looking for the business and end-user needs. Several software technologies supporting RE are available in the literature, but many do not cover all CSS specificities, notably those based on IoT. This research article presents RETIoT (Requirements Engineering Technology for the Internet of Things based software systems), aiming to provide methodological, technical, and tooling support to produce IoT software system requirements document. It is composed of an IoT scenario description technique, a checklist to verify IoT scenarios, construction processes, and templates for IoT software systems. A feasibility study was carried out in IoT system projects to observe its templates and identify improvement opportunities. The results indicate the feasibility of RETIoT templates' when used to capture IoT characteristics. However, further experimental studies represent research opportunities, strengthen confidence in its elements (construction process, techniques, and templates), and capture end-user perception.
Privacy and energy are primary concerns for sensor devices that offload compute to a potentially untrusted edge server or cloud. Homomorphic Encryption (HE) enables offload processing of encrypted data. HE offload processing retains data privacy, but is limited by the need for frequent communication between the client device and the offload server. Existing client-aided encrypted computing systems are optimized for performance on the offload server, failing to sufficiently address client costs, and precluding HE offload for low-resource (e.g., IoT) devices. We introduce Client-aided HE for Opaque Compute Offloading (CHOCO), a client-optimized system for encrypted offload processing. CHOCO introduces rotational redundancy, an algorithmic optimization to minimize computing and communication costs. We design Client-Aided HE for Opaque Compute Offloading Through Accelerated Cryptographic Operations (CHOCO-TACO), a comprehensive architectural accelerator for client-side cryptographic operations that eliminates most of their time and energy costs. Our evaluation shows that CHOCO makes client-aided HE offloading feasible for resource-constrained clients. Compared to existing encrypted computing solutions, CHOCO reduces communication cost by up to 2948x. With hardware support, client-side encryption/decryption is faster by 1094x and uses 648x less energy. In our end-to-end implementation of a large-scale DNN (VGG16), CHOCO uses 37% less energy than local (unencrypted) computation.
<p>Nowadays, traditional home gateways must support increasingly complex applications while keeping their cost reasonably low.<br />Software Defined Networking (SDN) would simplify the management of those devices, but such an approach is typically reserved for new hardware devices, specifically engineered for this paradigm.<br />As a consequence, typical SDN-based home gateway performs the switching in software, resulting in non-negligible performance degradation.<br />In this paper, we provide our experience and findings of adding the OpenFlow support into a non-OpenFlow compatible home gateway, exploiting the possible hardware speedup available in the existing platform.<br />We present our solution that transparently offloads a portion of the OpenFlow rule into the hardware, while keeping the remaining ones in software, being able to support the presence of multiple hardware tables with a different set of features.<br />Moreover, we illustrate the design choices used to implement the functionalities required by the OpenFlow protocol (e.g., packet-in, packet-out messages) and finally, we evaluate the resulting architecture, showing the significant advantage in terms of performance that can be achieved by exploiting the underlying hardware, while maintaining an SDN-type ability to program and to instantiate desired network operations from a central controller.</p>
We propose Atomic Active Messages (AAM), a mechanism that accelerates irregular graph computations on both shared- and distributed-memory machines. The key idea behind AAM is that hardware transactional memory (HTM) can be used for simple and efficient processing of irregular structures in highly parallel environments. We illustrate techniques such as coarsening and coalescing that enable hardware transactions to considerably accelerate graph processing.We conduct a detailed performance analysis of AAM on Intel Haswell and IBM Blue Gene/Q and we illustrate various performance tradeoffs between different HTM parameters that impact the efficiency of graph processing. AAM can be used to implement abstractions offered by existing programming models and to improve the performance of irregular graph processing codes such as Graph500 or Galois.
Recent researches on robotics have shown significant improvement, spanning from algorithms, mechanics to hardware architectures. Robotics, including manipulators, legged robots, drones, and autonomous vehicles, are now widely applied in diverse scenarios. However, the high computation and data complexity of robotic algorithms pose great challenges to its applications. On the one hand, CPU platform is flexible to handle multiple robotic tasks. GPU platform has higher computational capacities and easy-touse development frameworks, so they have been widely adopted in several applications. On the other hand, FPGA-based robotic accelerators are becoming increasingly competitive alternatives, especially in latency-critical and power-limited scenarios. With specialized designed hardware logic and algorithm kernels, FPGA-based accelerators can surpass CPU and GPU in performance and energy efficiency. In this paper, we give an overview of previous work on FPGA-based robotic accelerators covering different stages of the robotic system pipeline. An analysis of software and hardware optimization techniques and main technical issues is presented, along with some commercial and space applications, to serve as a guide for future work.
There are several approaches for using computers in deriving mathematical proofs. For their illustration, we provide an in-depth study of using computer support for proving one complex combinatorial conjecture -- correctness of a strategy for the chess KRK endgame. The final, machine verifiable, result presented in this paper is that there is a winning strategy for white in the KRK endgame generalized to $n \times n$ board (for natural $n$ greater than $3$). We demonstrate that different approaches for computer-based theorem proving work best together and in synergy and that the technology currently available is powerful enough for providing significant help to humans deriving complex proofs.
Moritz Schubotz, Norman Meuschke, Thomas Hepp
et al.
Mathematical expressions can be represented as a tree consisting of terminal symbols, such as identifiers or numbers (leaf nodes), and functions or operators (non-leaf nodes). Expression trees are an important mechanism for storing and processing mathematical expressions as well as the most frequently used visualization of the structure of mathematical expressions. Typically, researchers and practitioners manually visualize expression trees using general-purpose tools. This approach is laborious, redundant, and error-prone. Manual visualizations represent a user's notion of what the markup of an expression should be, but not necessarily what the actual markup is. This paper presents VMEXT - a free and open source tool to directly visualize expression trees from parallel MathML. VMEXT simultaneously visualizes the presentation elements and the semantic structure of mathematical expressions to enable users to quickly spot deficiencies in the Content MathML markup that does not affect the presentation of the expression. Identifying such discrepancies previously required reading the verbose and complex MathML markup. VMEXT also allows one to visualize similar and identical elements of two expressions. Visualizing expression similarity can support support developers in designing retrieval approaches and enable improved interaction concepts for users of mathematical information retrieval systems. We demonstrate VMEXT's visualizations in two web-based applications. The first application presents the visualizations alone. The second application shows a possible integration of the visualizations in systems for mathematical knowledge management and mathematical information retrieval. The application converts LaTeX input to parallel MathML, computes basic similarity measures for mathematical expressions, and visualizes the results using VMEXT.