Generative AI (GAI) holds great potential to improve software engineering productivity, but its untrustworthy outputs, particularly in code synthesis, pose significant challenges. The need for extensive verification and validation (V&V) of GAI-generated artifacts may undermine the potential productivity gains. This paper proposes a way of mitigating these risks by exploiting GAI's ability to generate multiple versions of code and tests to facilitate comparative analysis across versions. Rather than relying on the quality of a single test or code module, this "differential GAI" (D-GAI) approach promotes more reliable quality evaluation through version diversity. We introduce the Large-Scale Software Observatorium (LASSO), a platform that supports D-GAI by executing and analyzing large sets of code versions and tests. We discuss how LASSO enables rigorous evaluation of GAI-generated artifacts and propose its application in both software development and GAI research.
Software language design and implementation often involve specifications written in various esoteric meta-languages. Language workbenches generally include support for precise name-based navigation when browsing language specifications locally, but such support is lacking when browsing the same specifications online in code repositories. This paper presents a technique to support precise name-based navigation of language specifications in online repositories using ordinary web browsers. The idea is to generate hyperlinked twins: websites where verbatim copies of specification text are enhanced with hyperlinks between name references and declarations. By generating hyperlinks directly from the name binding analysis used internally in a language workbench, online navigation in hyperlinked twins is automatically consistent with local navigation. The presented technique has been implemented for the Spoofax language workbench, and used to generate hyperlinked twin websites from various language specifications in Spoofax meta-languages. However, the applicability of the technique is not limited to Spoofax, and developers of other language workbenches could presumably implement similar tooling, to make their language specifications more accessible to those who do not have the workbench installed.
Software log analysis can be laborious and time consuming. Time and labeled data are usually lacking in industrial settings. This paper studies unsupervised and time efficient methods for anomaly detection. We study two custom and two established models. The custom models are: an OOV (Out-Of-Vocabulary) detector, which counts the terms in the test data that are not present in the training data, and the Rarity Model (RM), which calculates a rarity score for terms based on their infrequency. The established models are KMeans and Isolation Forest. The models are evaluated on four public datasets (BGL, Thunderbird, Hadoop, HDFS) with three different representation techniques for the log messages (Words, character Trigrams, Parsed events). For training, we used both normal-only data, which is free of all anomalies, and unfiltered data, which contains both normal and anomalous instances. We used primarily the AUC-ROC metric for evaluation due to challenges in setting a threshold but we also include F1-scores for further insight. Different configurations are advised based on specific requirements. When training data is unfiltered, includes both normal and anomalous instances, the most effective combination is the Isolation Forest with event representation, achieving an AUC-ROC of 0.829. If it's possible to create a normal-only training dataset, combining the Out-Of-Vocabulary (OOV) detector with trigram representation yields the highest AUC-ROC of 0.846. For speed considerations, the OOV detector is optimal for filtered data, while the Rarity Model is the best choice for unfiltered data.
The low cost and rapid provisioning capabilities have made the cloud a desirable platform to launch complex scientific applications. However, resource utilization optimization is a significant challenge for cloud service providers, since the earlier focus is provided on optimizing resources for the applications that run on the cloud, with a low emphasis being provided on optimizing resource utilization of the cloud computing internal processes. Code refactoring has been associated with improving the maintenance and understanding of software code. However, analyzing the impact of the refactoring source code of the cloud and studying its impact on cloud resource usage require further analysis. In this paper, we propose a framework called Unified Regression Modelling (URegM) which predicts the impact of code smell refactoring on cloud resource usage. We test our experiments in a real-life cloud environment using a complex scientific application as a workload. Results show that URegM is capable of accurately predicting resource consumption due to code smell refactoring. This will permit cloud service providers with advanced knowledge about the impact of refactoring code smells on resource consumption, thus allowing them to plan their resource provisioning and code refactoring more effectively.
Software code quality is a construct with three dimensions: maintainability, reliability, and functionality. Although many firms have incorporated code quality metrics in their operations, evaluating these metrics still lacks consistent standards. We categorized distinct metrics into two types: 1) monotonic metrics that consistently influence code quality; and 2) non-monotonic metrics that lack a consistent relationship with code quality. To consistently evaluate them, we proposed a distribution-based method to get metric scores. Our empirical analysis includes 36,460 high-quality open-source software (OSS) repositories and their raw metrics from SonarQube and CK. The evaluated scores demonstrate great explainability on software adoption. Our work contributes to the multi-dimensional construct of code quality and its metric measurements, which provides practical implications for consistent measurements on both monotonic and non-monotonic metrics.
Software Log anomaly event detection with masked event prediction has various technical approaches with countless configurations and parameters. Our objective is to provide a baseline of settings for similar studies in the future. The models we use are the N-Gram model, which is a classic approach in the field of natural language processing (NLP), and two deep learning (DL) models long short-term memory (LSTM) and convolutional neural network (CNN). For datasets we used four datasets Profilence, BlueGene/L (BGL), Hadoop Distributed File System (HDFS) and Hadoop. Other settings are the size of the sliding window which determines how many surrounding events we are using to predict a given event, mask position (the position within the window we are predicting), the usage of only unique sequences, and the portion of data that is used for training. The results show clear indications of settings that can be generalized across datasets. The performance of the DL models does not deteriorate as the window size increases while the N-Gram model shows worse performance with large window sizes on the BGL and Profilence datasets. Despite the popularity of Next Event Prediction, the results show that in this context it is better not to predict events at the edges of the subsequence, i.e., first or last event, with the best result coming from predicting the fourth event when the window size is five. Regarding the amount of data used for training, the results show differences across datasets and models. For example, the N-Gram model appears to be more sensitive toward the lack of data than the DL models. Overall, for similar experimental setups we suggest the following general baseline: Window size 10, mask position second to last, do not filter out non-unique sequences, and use a half of the total data for training.
Unit level test has been widely recognized as an important approach to improve the software quality, as it can expose bugs earlier during the development phase. However, manual unit level test development is often tedious and insufficient. Also, it is hard for developers to precisely identify the most error prone code block deserving the best test coverage by themselves. In this paper, we present the automatic Unit level test framework we used for intel media driver development. It can help us identify the most critical code block, provide the test coverage recommendation, and automatically generate >80% ULT code (~400K Lines of test code) as well as ~35% test cases (~7K test cases) for intel media driver. It helps us to greatly shrink the average ULT development effort from ~24 Man hours to ~3 Man hours per 1000 Lines of driver source code.
Component-Based Development (CBD) is a popular approach to mitigating the costs of creating software systems. However, it is not clear to what extent the core component selection and adaptation activities of CBD can be implemented to operate automatically in an efficient and reliable manner or in what situations (if any) CBD is preferable to other approaches to software development. In this paper, we use computational complexity analysis to determine and compare the computational characteristics of fully automatic component-based software system creation and reconfiguration by de novo design, component selection, and component selection with adaptation. Our results show that none of these approaches can be implemented to operate both efficiently and reliably in a fully automatic manner either in general or relative to a number of restrictions on software systems, system requirements, components, and component adaptation. We also give restrictions under which all of these approaches can be implemented to operate both efficiently and reliably. As such, this paper illustrates how different types of computational complexity analysis (in particular, parameterized complexity analysis) can be used to systematically explore the algorithmic options for implementing automatic activities in software engineering.
Kwabena Ebo Bennin, Amjed Tahir, Stephen G. MacDonell
et al.
Crossp-roject defect prediction (CPDP), where data from different software projects are used to predict defects, has been proposed as a way to provide data for software projects that lack historical data. Evaluations of CPDP models using the Nearest Neighbour (NN) Filter approach have shown promising results in recent studies. A key challenge with defect-prediction datasets is class imbalance, that is highly skewed datasets where non buggy modules dominate the buggy modules. In the past, data resampling approaches have been applied to within-projects defect prediction models to help alleviate the negative effects of class imbalance in the datasets. To address the class imbalance issue in CPDP, the authors assess the impact of data resampling approaches on CPDP models after the NN Filter is applied. The impact on prediction performance of five oversampling approaches (MAHAKIL, SMOTE, Borderline-SMOTE, Random Oversampling, and ADASYN) and three undersampling approaches (Random Undersampling, Tomek Links, and Onesided selection) is investigated and results are compared to approaches without data resampling. The authors' examined six defect prediction models on 34 datasets extracted from the PROMISE repository. The authors results show that there is a significant positive effect of data resampling on CPDP performance, suggesting that software quality teams and researchers should consider applying data resampling approaches for improved recall (pd) and g-measure prediction performance. However if the goal is to improve precision and reduce false alarm (pf) then data resampling approaches should be avoided.
Context: User involvement is generally considered to contributing to user satisfaction and project success and is central to Agile software development. In theory, the expectations about user involvement, such as the PO's, are quite demanding in this Agile way of working. But what are the expectations seen in practice, and are the expectations of user involvement aligned among the development team and users? Any misalignment could contribute to conflict and miscommunication among stakeholders that may result in ineffective user involvement. Objective: Our aim is to compare and contrast the expectations of two stakeholder groups (software development team, and software users) about user involvement in order to understand the expectations and assess their alignment. Method: We have conducted an exploratory case study of expectations about user involvement in an Agile software development. Qualitative data was collected through interviews to design a novel method for the assessing the alignment of expectations about user involvement by applying Repertory Grids (RG). Results: By aggregating the results from the interviews and RGs, varying degrees of expectation alignments were observed between the development team and user representatives. Conclusion: Alignment of expectations can be assessed in practice using the proposed RG instrument and can reveal misalignment between user roles and activities they participate in Agile software development projects. Although we used RG instrument retrospectively in this study, we posit that it could also be applied from the start of a project, or proactively as a diagnostic tool throughout a project to assess and ensure that expectations are aligned.
Md Abdullah Al Alamin, Sanjay Malakar, Gias Uddin
et al.
Low-code software development (LCSD) is an emerging paradigm that combines minimal source code with interactive graphical interfaces to promote rapid application development. LCSD aims to democratize application development to software practitioners with diverse backgrounds. Given that LCSD is relatively a new paradigm, it is vital to learn about the challenges developers face during their adoption of LCSD platforms. The online developer forum, Stack Overflow (SO), is popular among software developers to ask for solutions to their technical problems. We observe a growing body of posts in SO with discussions of LCSD platforms. In this paper, we present an empirical study of around 5K SO posts (questions + accepted answers) that contain discussions of nine popular LCSD platforms. We apply topic modeling on the posts to determine the types of topics discussed. We find 13 topics related to LCSD in SO. The 13 topics are grouped into four categories: Customization, Platform Adoption, Database Management, and Third-Party Integration. More than 40% of the questions are about customization, i.e., developers frequently face challenges with customizing user interfaces or services offered by LCSD platforms. The topic "Dynamic Event Handling" under the "Customization" category is the most popular (in terms of average view counts per question of the topic) as well as the most difficult. It means that developers frequently search for customization solutions such as how to attach dynamic events to a form in low-code UI, yet most (75.9%) of their questions remain without an accepted answer. We manually label 900 questions from the posts to determine the prevalence of the topics' challenges across LCSD phases. We find that most of the questions are related to the development phase, and low-code developers also face challenges with automated testing.
Method-level historical information is useful in research on mining software repositories such as fault-prone module detection or evolutionary coupling identification. An existing technique named Historage converts a Git repository of a Java project to a finer-grained one. In a finer-grained repository, each Java method exists as a single file. Treating Java methods as files has an advantage, which is that Java methods can be tracked with Git mechanisms. The biggest benefit of tracking methods with Git mechanisms is that it can easily connect with any other tools and techniques build on Git infrastructure. However, Historage's tracking has an issue of accuracy, especially on small methods. More concretely, in the case that a small method is renamed or moved to another class, Historage has a limited capability to track the method. In this paper, we propose a new technique, FinerGit, to improve the trackability of Java methods with Git mechanisms. We implement FinerGit as a system and apply it to 182 open source software projects, which include 1,768K methods in total. The experimental results show that our tool has a higher capability of tracking methods in the case that methods are renamed or moved to other classes.
In the era of the Big Data revolution, methods for the automatic discovery of regularities in large datasets are becoming essential tools in applied sciences. This article presents an open software package, named MODULO (MODal mULtiscale pOd), to perform the Multiscale Proper Orthogonal Decomposition (mPOD) of numerical and experimental data. This novel decomposition combines Multi-resolution Analysis (MRA) and standard Proper Orthogonal Decomposition (POD) to allow for the optimal compromise between decomposition convergence and spectral purity of its modes. The software is equipped with a Graphical User Interface (GUI) and enriched by numerous examples and video tutorials (see Youtube channel MODULO mPOD). The MATLAB source codes and an executable for Windows users can be downloaded at \url{https://github.com/mendezVKI/MODULO/releases}; a collection of exercises in Matlab and Python are provided in \url{https://github.com/mendezVKI/MODULO}
Unit tests are an important artifact that supports the software development process in several ways. For example, when a test fails, its name can provide the first step towards understanding the purpose of the test. Unfortunately, unit tests often lack descriptive names. In this paper, we propose a new, pattern-based approach that can help developers improve the quality of test names of JUnit tests by making them more descriptive. It does this by detecting non-descriptive test names and in some cases, providing additional information about how the name can be improved. Our approach was assessed using an empirical evaluation on 34352 JUnit tests. The results of the evaluation show that the approach is feasible, accurate, and useful at discriminating descriptive and non-descriptive names with a 95% true-positive rate.
Software debloating is an emerging field of study aimed at improving the security and performance of software by removing excess library code and features that are not needed by the end user (called bloat). Software bloat is pervasive, and several debloating techniques have been proposed to address this problem. While these techniques are effective at reducing bloat, they are not practical for the average user, risk creating unsound programs and introducing vulnerabilities, and are not well suited for debloating complex software such as network protocol implementations. In this paper, we propose CARVE, a simple yet effective security-focused debloating technique that overcomes these limitations. CARVE employs static source code annotation to map software features source code, eliminating the need for advanced software analysis during debloating and reducing the overall level of technical sophistication required by the user. CARVE surpasses existing techniques by introducing debloating with replacement, a technique capable of preserving software interoperability and mitigating the risk of creating an unsound program or introducing a vulnerability. We evaluate CARVE in 12 debloating scenarios and demonstrate security and performance improvements that meet or exceed those of existing techniques.
Daniel Link, Pooyan Behnamghader, Ramin Moazeni
et al.
The stakeholders of a system are legitimately interested in whether and how its architecture reflects their respective concerns at each point of its development and maintenance processes. Having such knowledge available at all times would enable them to continually adjust their systems structure at each juncture and reduce the buildup of technical debt that can be hard to reduce once it has persisted over many iterations. Unfortunately, software systems often lack reliable and current documentation about their architecture. In order to remedy this situation, researchers have conceived a number of architectural recovery methods, some of them concern-oriented. However, the design choices forming the bases of most existing recovery methods make it so none of them have a complete set of desirable qualities for the purpose stated above. Tailoring a recovery to a system is either not possible or only through iterative experiments with numeric parameters. Furthermore, limitations in their scalability make it prohibitive to apply the existing techniques to large systems. Finally, since several current recovery methods employ non-deterministic sampling, their inconsistent results do not lend themselves well to tracking a systems course over several versions, as needed by its stakeholders. RELAX (RELiable Architecture EXtraction), a new concern-based recovery method that uses text classification, addresses these issues efficiently by (1) assembling the overall recovery result from smaller, independent parts, (2) basing it on an algorithm with linear time complexity and (3) being tailorable to the recovery of a single system or a sequence thereof through the selection of meaningfully named, semantic topics. An intuitive, informative architectural visualization rounds out RELAX's contributions. RELAX is illustrated on a number of existing open-source systems and compared to other recovery methods.
Thomas Kirchner, Janek Gröhl, Franz Sattler
et al.
Photoacoustic (PA) imaging systems based on clinical linear ultrasound arrays have become increasingly popular in translational PA research. Such systems can be more easily integrated in a clinical workflow due to the simultaneous access to ultrasonic imaging and their familiarity of use to clinicians. In contrast to more complex setups, hand held linear probes can be applied to a large variety of clinical use cases. However, most translational work with such scanners is based on proprietary development and as such not accessible to the community. In this contribution, we present a custom-built, hybrid, multispectral, real-time photoacoustic and ultrasonic imaging system with a linear array probe that is controlled by software developed within the Medical Imaging Interaction Toolkit (MITK) a highly customizable and extendable open-source software platform. Our software offers direct control of both the laser and the ultrasonic system and may serve as a starting point for various translational research projects and developments. To demonstrate the applicability of the platform, we used it to implement a new method for blood oxygenation estimation in the presence of non-rigid inter-frame motion caused by pulsing arteries. Initial results from experiments with healthy human volunteers demonstrate the suitability of the method with the sample clinical application of imaging the common carotid artery as well as peripheral extremity vessels.
Henry Edison, Nina M. Smørsgård, Xiaofeng Wang
et al.
To compete in this age of disruption, large companies cannot rely on cost efficiency, lead time reduction and quality improvement. They are now looking for ways to innovate like startups. Meanwhile, the awareness and use of the Lean startup approach have grown rapidly amongst the software startup community in recent years. This study investigates how Lean internal startup facilitates software product innovation in large companies and identifies its enablers and inhibitors. A multiple case study approach is followed in the investigation. Two software product innovation projects from two large companies are examined, using a conceptual framework that is based on the method-in-action framework and extended with the previously developed Lean-Internal Corporate Venture model. Seven face-to-face in-depth interviews of the employees with different roles are conducted. Within-case analysis and cross-case comparison are applied to draw the findings from the cases. A generic process flow summarises the common key processes of Lean internal startups. The findings suggest that an internal startup that is initiated management or employees faces different challenges. A list of enablers of applying Lean startup in large companies are identified, including top management support and cross-functional team. Both cases face different inhibitors due to the different process of inception, objective of the team and type of the product. Our contributions are threefold. First, this study is one of the first attempt to investigate the use of Lean startup approach in large companies empirically. Second, the study shows the potential of the method-in-action framework to investigate the Lean startup approach in non-startup context. The third is a general process of Lean internal startup and the evidence of the enablers and inhibitors of implementing it, which are both theory-informed and empirically grounded.
The development of high-quality software or software-intensive systems requires custom-tailored process models that fit the organizational and project goals as well as the development contexts. These models are a necessary prerequisite for creating project plans that are expected to fulfill business goals. Although project planners require individual process models custom-tailored to their constraints, software or system developing organizations also require generic processes (i.e., reference processes) that capture project-independent knowledge for similar development contexts. The latter is emphazised by assessment approaches (such as CMMI, SPICE) that require explicit process descriptions in order to reach a certain capability or maturity level. Among other concepts such as polymorphism, templates, or generator-based descriptions, software process patterns are used to describe generic process knowledge. Several approaches for describing the architecture of process patterns have already been published (e.g., [7]). However, there is a lack of descriptions on how to compose process patterns for a specific decelopment context in order to gain a custom-tailored process model for a project. This paper focuses on the composition of process patterns in a goal-oriented way. First, the paper describes which information a process pattern should contain so that it can be used for systematic composition. Second, a composition method is sketched. Afterwards, the results of a proof-of-concept evaluation of the method are described. Finally, the paper is summarized and open research questions are sketched.