Generative Software Engineering
Yuan Huang, Yinan Chen, Xiangping Chen
et al.
The rapid development of deep learning techniques, improved computational power, and the availability of vast training data have led to significant advancements in pre-trained models and large language models (LLMs). Pre-trained models based on architectures such as BERT and Transformer, as well as LLMs like ChatGPT, have demonstrated remarkable language capabilities and found applications in Software engineering. Software engineering tasks can be divided into many categories, among which generative tasks are the most concern by researchers, where pre-trained models and LLMs possess powerful language representation and contextual awareness capabilities, enabling them to leverage diverse training data and adapt to generative tasks through fine-tuning, transfer learning, and prompt engineering. These advantages make them effective tools in generative tasks and have demonstrated excellent performance. In this paper, we present a comprehensive literature review of generative tasks in SE using pre-trained models and LLMs. We accurately categorize SE generative tasks based on software engineering methodologies and summarize the advanced pre-trained models and LLMs involved, as well as the datasets and evaluation metrics used. Additionally, we identify key strengths, weaknesses, and gaps in existing approaches, and propose potential research directions. This review aims to provide researchers and practitioners with an in-depth analysis and guidance on the application of pre-trained models and LLMs in generative tasks within SE.
Identifying relevant Factors of Requirements Quality: an industrial Case Study
Julian Frattini
[Context and Motivation]: The quality of requirements specifications impacts subsequent, dependent software engineering activities. Requirements quality defects like ambiguous statements can result in incomplete or wrong features and even lead to budget overrun or project failure. [Problem]: Attempts at measuring the impact of requirements quality have been held back by the vast amount of interacting factors. Requirements quality research lacks an understanding of which factors are relevant in practice. [Principal Ideas and Results]: We conduct a case study considering data from both interview transcripts and issue reports to identify relevant factors of requirements quality. The results include 17 factors and 11 interaction effects relevant to the case company. [Contribution]: The results contribute empirical evidence that (1) strengthens existing requirements engineering theories and (2) advances industry-relevant requirements quality research.
Order-preserving pattern mining with forgetting mechanism
Yan Li, Chenyu Ma, Rong Gao
et al.
Order-preserving pattern (OPP) mining is a type of sequential pattern mining method in which a group of ranks of time series is used to represent an OPP. This approach can discover frequent trends in time series. Existing OPP mining algorithms consider data points at different time to be equally important; however, newer data usually have a more significant impact, while older data have a weaker impact. We therefore introduce the forgetting mechanism into OPP mining to reduce the importance of older data. This paper explores the mining of OPPs with forgetting mechanism (OPF) and proposes an algorithm called OPF-Miner that can discover frequent OPFs. OPF-Miner performs two tasks, candidate pattern generation and support calculation. In candidate pattern generation, OPF-Miner employs a maximal support priority strategy and a group pattern fusion strategy to avoid redundant pattern fusions. For support calculation, we propose an algorithm called support calculation with forgetting mechanism, which uses prefix and suffix pattern pruning strategies to avoid redundant support calculations. The experiments are conducted on nine datasets and 12 alternative algorithms. The results verify that OPF-Miner is superior to other competitive algorithms. More importantly, OPF-Miner yields good clustering performance for time series, since the forgetting mechanism is employed.
Leveraging Large Language Models (LLMs) for Process Mining (Technical Report)
Alessandro Berti, Mahnaz Sadat Qafari
This technical report describes the intersection of process mining and large language models (LLMs), specifically focusing on the abstraction of traditional and object-centric process mining artifacts into textual format. We introduce and explore various prompting strategies: direct answering, where the large language model directly addresses user queries; multi-prompt answering, which allows the model to incrementally build on the knowledge obtained through a series of prompts; and the generation of database queries, facilitating the validation of hypotheses against the original event log. Our assessment considers two large language models, GPT-4 and Google's Bard, under various contextual scenarios across all prompting strategies. Results indicate that these models exhibit a robust understanding of key process mining abstractions, with notable proficiency in interpreting both declarative and procedural process models. In addition, we find that both models demonstrate strong performance in the object-centric setting, which could significantly propel the advancement of the object-centric process mining discipline. Additionally, these models display a noteworthy capacity to evaluate various concepts of fairness in process mining. This opens the door to more rapid and efficient assessments of the fairness of process mining event logs, which has significant implications for the field. The integration of these large language models into process mining applications may open new avenues for exploration, innovation, and insight generation in the field.
AutoAM: An End-To-End Neural Model for Automatic and Universal Argument Mining
Lang Cao
Argument mining is to analyze argument structure and extract important argument information from unstructured text. An argument mining system can help people automatically gain causal and logical information behind the text. As argumentative corpus gradually increases, like more people begin to argue and debate on social media, argument mining from them is becoming increasingly critical. However, argument mining is still a big challenge in natural language tasks due to its difficulty, and relative techniques are not mature. For example, research on non-tree argument mining needs to be done more. Most works just focus on extracting tree structure argument information. Moreover, current methods cannot accurately describe and capture argument relations and do not predict their types. In this paper, we propose a novel neural model called AutoAM to solve these problems. We first introduce the argument component attention mechanism in our model. It can capture the relevant information between argument components, so our model can better perform argument mining. Our model is a universal end-to-end framework, which can analyze argument structure without constraints like tree structure and complete three subtasks of argument mining in one model. The experiment results show that our model outperforms the existing works on several metrics in two public datasets.
Zero-shot Bilingual App Reviews Mining with Large Language Models
Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais
et al.
App reviews from app stores are crucial for improving software requirements. A large number of valuable reviews are continually being posted, describing software problems and expected features. Effectively utilizing user reviews necessitates the extraction of relevant information, as well as their subsequent summarization. Due to the substantial volume of user reviews, manual analysis is arduous. Various approaches based on natural language processing (NLP) have been proposed for automatic user review mining. However, the majority of them requires a manually crafted dataset to train their models, which limits their usage in real-world scenarios. In this work, we propose Mini-BAR, a tool that integrates large language models (LLMs) to perform zero-shot mining of user reviews in both English and French. Specifically, Mini-BAR is designed to (i) classify the user reviews, (ii) cluster similar reviews together, (iii) generate an abstractive summary for each cluster and (iv) rank the user review clusters. To evaluate the performance of Mini-BAR, we created a dataset containing 6,000 English and 6,000 French annotated user reviews and conducted extensive experiments. Preliminary results demonstrate the effectiveness and efficiency of Mini-BAR in requirement engineering by analyzing bilingual app reviews. (Replication package containing the code, dataset, and experiment setups on https://github.com/Jl-wei/mini-bar )
Multi-granularity Argument Mining in Legal Texts
Huihui Xu, Kevin Ashley
In this paper, we explore legal argument mining using multiple levels of granularity. Argument mining has usually been conceptualized as a sentence classification problem. In this work, we conceptualize argument mining as a token-level (i.e., word-level) classification problem. We use a Longformer model to classify the tokens. Results show that token-level text classification identifies certain legal argument elements more accurately than sentence-level text classification. Token-level classification also provides greater flexibility to analyze legal texts and to gain more insight into what the model focuses on when processing a large amount of input data.
FRI-Miner: Fuzzy Rare Itemset Mining
Yanling Cui, Wensheng Gan, Hong Lin
et al.
Data mining is a widely used technology for various real-life applications of data analytics and is important to discover valuable association rules in transaction databases. Interesting itemset mining plays an important role in many real-life applications, such as market, e-commerce, finance, and medical treatment. To date, various data mining algorithms based on frequent patterns have been widely studied, but there are a few algorithms that focus on mining infrequent or rare patterns. In some cases, infrequent or rare itemsets and rare association rules also play an important role in real-life applications. In this paper, we introduce a novel fuzzy-based rare itemset mining algorithm called FRI-Miner, which discovers valuable and interesting fuzzy rare itemsets in a quantitative database by applying fuzzy theory with linguistic meaning. Additionally, FRI-Miner utilizes the fuzzy-list structure to store important information and applies several pruning strategies to reduce the search space. The experimental results show that the proposed FRI-Miner algorithm can discover fewer and more interesting itemsets by considering the quantitative value in reality. Moreover, it significantly outperforms state-of-the-art algorithms in terms of effectiveness (w.r.t. different types of derived patterns) and efficiency (w.r.t. running time and memory usage).
Process Mining-Driven Analysis of the COVID19 Impact on the Vaccinations of Victorian Patients
Adriano Augusto, Timothy Deitz, Noel Faux
et al.
Process mining is a discipline sitting between data mining and process science, whose goal is to provide theoretical methods and software tools to analyse process execution data, known as event logs. Although process mining was originally conceived to facilitate business process management activities, research studies have shown the benefit of leveraging process mining tools in different contexts, including healthcare. However, applying process mining tools to analyse healthcare process execution data is not straightforward. In this paper, we report the analysis of an event log recording more than 30 million events capturing the general practice healthcare processes of more than one million patients in Victoria--Australia--over five years. Our analysis allowed us to understand benefits and limitations of the state-of-the-art process mining techniques when dealing with highly variable processes and large data-sets. While we provide solutions to the identified limitations, the overarching goal of this study was to detect differences between the patients` health services utilization pattern observed in 2020--during the COVID-19 pandemic and mandatory lock-downs --and the one observed in the prior four years, 2016 to 2019. By using a combination of process mining techniques and traditional data mining, we were able to demonstrate that vaccinations in Victoria did not drop drastically--as other interactions did. On the contrary, we observed a surge of influenza and pneumococcus vaccinations in 2020, contradicting research findings of similar studies conducted in different geographical areas.
cgSpan: Pattern Mining in Conceptual Graphs
Adam Faci, Marie-Jeanne Lesot, Claire Laudy
Conceptual Graphs (CGs) are a graph-based knowledge representation formalism. In this paper we propose cgSpan a CG frequent pattern mining algorithm. It extends the DMGM-GSM algorithm that takes taxonomy-based labeled graphs as input; it includes three more kinds of knowledge of the CG formalism: (a) the fixed arity of relation nodes, handling graphs of neighborhoods centered on relations rather than graphs of nodes, (b) the signatures, avoiding patterns with concept types more general than the maximal types specified in signatures and (c) the inference rules, applying them during the pattern mining process. The experimental study highlights that cgSpan is a functional CG Frequent Pattern Mining algorithm and that including CGs specificities results in a faster algorithm with more expressive results and less redundancy with vocabulary.
An Eye for an Eye: Economics of Retaliation in Mining Pools
Yujin Kwon, Hyoungshick Kim, Yung Yi
et al.
Currently, miners typically join mining pools to solve cryptographic puzzles together, and mining pools are in high competition. This has led to the development of several attack strategies such as block withholding (BWH) and fork after withholding (FAW) attacks that can weaken the health of PoW systems and but maximize mining pools' profits. In this paper, we present strategies called Adaptive Retaliation Strategies (ARS) to mitigate not only BWH attacks but also FAW attacks. In ARS, each pool cooperates with other pools in the normal situation, and adaptively executes either FAW or BWH attacks for the purpose of retaliation only when attacked. In addition, in order for rational pools to adopt ARS, ARS should strike to an adaptive balance between retaliation and selfishness because the pools consider their payoff even when they retaliate. We theoretically and numerically show that ARS would not only lead to the induction of a no-attack state among mining pools, but also achieve the adaptive balance between retaliation and selfishness.
Privacy Preserving Utility Mining: A Survey
Wensheng Gan, Jerry Chun-Wei Lin, Han-Chieh Chao
et al.
In big data era, the collected data usually contains rich information and hidden knowledge. Utility-oriented pattern mining and analytics have shown a powerful ability to explore these ubiquitous data, which may be collected from various fields and applications, such as market basket analysis, retail, click-stream analysis, medical analysis, and bioinformatics. However, analysis of these data with sensitive private information raises privacy concerns. To achieve better trade-off between utility maximizing and privacy preserving, Privacy-Preserving Utility Mining (PPUM) has become a critical issue in recent years. In this paper, we provide a comprehensive overview of PPUM. We first present the background of utility mining, privacy-preserving data mining and PPUM, then introduce the related preliminaries and problem formulation of PPUM, as well as some key evaluation criteria for PPUM. In particular, we present and discuss the current state-of-the-art PPUM algorithms, as well as their advantages and deficiencies in detail. Finally, we highlight and discuss some technical challenges and open directions for future research on PPUM.
Development of formal models, algorithms, procedures, engineering and functioning of the software system "Instrumental complex for ontological engineering purpose"
A. V. Palagin, N. G. Petrenko, V. Yu. Velychko
et al.
The given paper considered a generalized model representation of the software system "Instrumental complex for ontological engineering purpose". Represented complete software system development process. Developed relevant formal models of the software system "Instrumental complex for ontological engineering purpose", represented as mathematical expressions, UML diagrams, and also described the three-tier architecture of the software system "Instrumental complex for ontological engineering purpose" in a client-server environment.
MMM: May I Mine Your Mind?
Diego Sempreboni, Luca Viganò
Consider the following set-up for the plot of a possible future episode of the TV series Black Mirror: human brains can be connected directly to the net and MiningMind Inc. has developed a technology that merges a reward system with a cryptojacking engine that uses the human brain to mine cryptocurrency (or to carry out some other mining activity). Part of our brain will be committed to cryptographic calculations (mining), leaving the remaining part untouched for everyday operations, i.e., for our brain's normal daily activity. In this short paper, we briefly argue why this set-up might not be so far fetched after all, and explore the impact that such a technology could have on our lives and our society.
The Essence Theory of Software Engineering - Large-Scale Classroom Experiences from 450+ Software Engineering BSc Students
Kai-Kristian Kemell, Anh Nguyen-Duc, Xiaofeng Wang
et al.
Software Engineering as an industry is highly diverse in terms of development methods and practices. Practitioners employ a myriad of methods and tend to further tailor them by e.g. omitting some practices or rules. This diversity in development methods poses a challenge for software engineering education, creating a gap between education and industry. General theories such as the Essence Theory of Software Engineering can help bridge this gap by presenting software engineering students with higher-level frameworks upon which to build an understanding of software engineering methods and practical project work. In this paper, we study Essence in an educational setting to evaluate its usefulness for software engineering students while also investigating barriers to its adoption in this context. To this end, we observe 102 student teams utilize Essence in practical software engineering projects during a semester long, project-based course.
Interconnected Linguistic Architecture
Johannes Härtel, Lukas Härtel, Ralf Lämmel
et al.
The context of the reported research is the documentation of software technologies such as object/relational mappers, web-application frameworks, or code generators. We assume that documentation should model a macroscopic view on usage scenarios of technologies in terms of involved artifacts, leveraged software languages, data flows, conformance relationships, and others. In previous work, we referred to such documentation also as 'linguistic architecture'. The corresponding models may also be referred to as 'megamodels' while adopting this term from the technological space of modeling/model-driven engineering. This work is an inquiry into making such documentation less abstract and more effective by means of connecting (mega)models, systems, and developer experience in several ways. To this end, we adopt an approach that is primarily based on prototyping (i.e., implementa- tion of a megamodeling infrastructure with all conceivable connections) and experimentation with showcases (i.e., documentation of concrete software technologies). The knowledge gained by this research is a notion of interconnected linguistic architecture on the grounds of connecting primary model elements, inferred model elements, static and runtime system artifacts, traceability links, system contexts, knowledge resources, plugged interpretations of model elements, and IDE views. A corresponding suite of aspects of interconnected linguistic architecture is systematically described. As to the grounding of this research, we describe a literature survey which tracks scattered occurrences and thus demonstrates the relevance of the identified aspects of interconnected linguistic architecture. Further, we describe the MegaL/Xtext+IDE infrastructure which realizes interconnected linguistic architecture. The importance of this work lies in providing more formal (ontologically rich, navigable, verifiable) documentation of software technologies helping developers to better understand how to use technologies in new systems (prescriptive mode) or how technologies are used in existing systems (descriptive mode).
DAMEWARE - Data Mining & Exploration Web Application Resource
Massimo Brescia, Stefano Cavuoti, Francesco Esposito
et al.
Astronomy is undergoing through a methodological revolution triggered by an unprecedented wealth of complex and accurate data. DAMEWARE (DAta Mining & Exploration Web Application and REsource) is a general purpose, Web-based, Virtual Observatory compliant, distributed data mining framework specialized in massive data sets exploration with machine learning methods. We present the DAMEWARE (DAta Mining & Exploration Web Application REsource) which allows the scientific community to perform data mining and exploratory experiments on massive data sets, by using a simple web browser. DAMEWARE offers several tools which can be seen as working environments where to choose data analysis functionalities such as clustering, classification, regression, feature extraction etc., together with models and algorithms.
Are Delayed Issues Harder to Resolve? Revisiting Cost-to-Fix of Defects throughout the Lifecycle
Tim Menzies, William Nichols, Forrest Shull
et al.
Many practitioners and academics believe in a delayed issue effect (DIE); i.e. the longer an issue lingers in the system, the more effort it requires to resolve. This belief is often used to justify major investments in new development processes that promise to retire more issues sooner. This paper tests for the delayed issue effect in 171 software projects conducted around the world in the period from 2006--2014. To the best of our knowledge, this is the largest study yet published on this effect. We found no evidence for the delayed issue effect; i.e. the effort to resolve issues in a later phase was not consistently or substantially greater than when issues were resolved soon after their introduction. This paper documents the above study and explores reasons for this mismatch between this common rule of thumb and empirical data. In summary, DIE is not some constant across all projects. Rather, DIE might be an historical relic that occurs intermittently only in certain kinds of projects. This is a significant result since it predicts that new development processes that promise to faster retire more issues will not have a guaranteed return on investment (depending on the context where applied), and that a long-held truth in software engineering should not be considered a global truism.
Using Answer Set Programming for pattern mining
Thomas Guyet, Yves Moinard, René Quiniou
Serial pattern mining consists in extracting the frequent sequential patterns from a unique sequence of itemsets. This paper explores the ability of a declarative language, such as Answer Set Programming (ASP), to solve this issue efficiently. We propose several ASP implementations of the frequent sequential pattern mining task: a non-incremental and an incremental resolution. The results show that the incremental resolution is more efficient than the non-incremental one, but both ASP programs are less efficient than dedicated algorithms. Nonetheless, this approach can be seen as a first step toward a generic framework for sequential pattern mining with constraints.
Simple linear algorithms for mining graph cores
Yang Xiang
Batagelj and Zaversnik proposed a linear algorithm for the well-known $k$-core decomposition problem. However, when $k$-cores are desired for a given $k$, we find that a simple linear algorithm requiring no sorting works for mining $k$-cores. In addition, this algorithm can be extended to mine $(k_1, k_2,\ldots, k_p)$-cores from $p$-partite graphs in linear time, and this mining approach can be efficiently implemented in a distributed computing environment with a lower message complexity bound in comparison with the best known method of distributed $k$-core decomposition.