Hasil "Standardization. Simplification. Waste"

arXiv Open Access 2026

Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification

Anthony Lamelas

Large language models have become extremely popular recently due to their ability to achieve strong performance on a variety of tasks, such as text generation and rewriting, but their size and computation cost make them difficult to access, deploy, and secure in many settings. This paper investigates whether small, decoder-only language models can provide an efficient alternative for the tasks of grammar correction and text simplification. The experiments in this paper focus on testing small language models out of the box, fine-tuned, and run sequentially on the JFLEG and ASSET datasets using established metrics. The results show that while SLMs may learn certain behaviors well, their performance remains below strong baselines and current LLMs. The results also show that SLMs struggle with retaining meaning and hallucinations. These findings suggest that despite their efficiency advantages, current SLMs are not yet competitive enough with modern LLMs for rewriting, and further advances in training are required for SLMs to close the performance gap between them and today's LLMs.

en cs.CL

Detail Sumber

arXiv Open Access 2026

Eco-Friendly Supercapacitor Architecture Based on Cotton Textile Waste and Biopolymer-Based Electrodes

Luis Torres Quispe, Clemente Luyo Caycho, Javier Quino-Favero et al.

This study presents an eco-friendly and bio-based symmetric supercapacitor using cotton textile waste-derived hydrogels as electrolytes and chitosan-based carbon electrodes as metal-free charge-storage components. Cotton-derived hydrogels were synthesized via an alkaline dissolution-gelation route and modified with ammonium thiocyanate (NH4SCN) to enhance ionic conductivity. The ionic modification increased the hydrogel conductivity from 17.1 to 37.8 mS cm^-1, confirming a nearly twofold improvement in ion transport efficiency. The resulting hydrogel exhibited improved thermal stability, homogeneous ionic transport, and strong polymer-ion interactions confirmed by FTIR and TGA analyses. In a symmetric device, the ion-modified hydrogel enables reduced equivalent series resistance, faster charge-transfer kinetics, and a short time constant (tau = 3.2 s), comparable to commercial energy-storage systems. After 1000 cycles, the device exhibits a 12.3% increase in specific capacitance, confirming stable proof-of-concept operation. Cycling leads to a moderate increase in R_ESR (18 to 22 ohm) and tau (3.2 to 4.1 s), indicating slower charge-ion redistribution. Notably, this R_ESR includes the contribution of the test-cell setup; in compact coin-type configurations, the resistance would be considerably lower. EIS reveals a concurrent rise in interfacial resistive terms, consistent with post-cycling hydrogel darkening and FTIR evidence of Fe-SCN coordination, suggesting that resistance buildup mainly originates from minor Fe-SCN interactions when the expelled liquid reaches the stainless-steel collector, rather than from loss of capacitive functionality. Overall, these results demonstrate the viability of cotton waste-derived hydrogels and chitosan-based electrodes as sustainable components for green energy storage, offering a recyclable and eco-friendly alternative to conventional systems.

en cond-mat.mtrl-sci

Detail Sumber

CrossRef Open Access 2025

Topological Graph Simplification Solutions to the Street Intersection Miscount Problem

Geoff Boeing

Street intersection counts and densities are ubiquitous measures in transport geography and planning. However, typical street network data and typical street network analysis tools can substantially overcount them. This article explains the three main reasons why this happens and presents solutions to each. It contributes algorithms to automatically simplify spatial graphs of urban street networks---via edge simplification and node consolidation---resulting in faster parsimonious models and more accurate network measures like intersection counts and densities, street segment lengths, and node degrees. These algorithms' information compression improves downstream graph analytics' memory and runtime efficiency, boosting analytical tractability without loss of model fidelity. Finally, this article validates these algorithms and empirically assesses intersection count biases worldwide to demonstrate the problem's widespread prevalence. Without consolidation, traditional methods would overestimate the median urban area intersection count by 14\%. However, this bias varies drastically across regions, underscoring these algorithms' importance for consistent comparative empirical analyses.

en

Detail DOI Sumber

arXiv Open Access 2025

LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models

Yan Wang, Ling Ding, Tien N Nguyen et al.

Large Language Models for code often entail significant computational complexity, which grows significantly with the length of the input code sequence. We propose LeanCode for code simplification to reduce training and prediction time, leveraging code contexts in utilizing attention scores to represent the tokens' importance. We advocate for the selective removal of tokens based on the average context-aware attention scores rather than average scores across all inputs. LeanCode uses the attention scores of `CLS' tokens within the encoder for classification tasks, such as code search. It also employs the encoder-decoder attention scores to determine token significance for sequence-to-sequence tasks like code summarization. Our evaluation shows LeanCode's superiority over the SOTAs DietCode and Slimcode, with improvements of 60% and 16% for code search, and 29% and 27% for code summarization, respectively.

en cs.SE, cs.LG

Detail DOI Sumber

arXiv Open Access 2025

Pattern Recognition of Illicit E-Waste Misclassification in Global Trade Data

Muhammad Sukri Bin Ramli

The global trade in electronic and electrical goods is complicated by the challenge of identifying e-waste, which is often misclassified to evade regulations. Traditional analysis methods struggle to discern the underlying patterns of this illicit trade within vast datasets. This research proposes and validates a robust, data-driven framework to segment products and identify goods exhibiting an anomalous "waste signature" a trade pattern defined by a clear 'inverse price-volume'. The core of the framework is an Outlier-Aware Segmentation method, an iterative K-Means approach that first isolates extreme outliers to prevent data skewing and then re-clusters the remaining products to reveal subtle market segments. To quantify risk, a "Waste Score" is developed using a Logistic Regression model that identifies products whose trade signatures are statistically similar to scrap. The findings reveal a consistent four-tier market hierarchy in both Malaysian and global datasets. A key pattern emerged from a comparative analysis: Malaysia's market structure is defined by high-volume bulk commodities, whereas the global market is shaped by high-value capital goods, indicating a unique national specialization. The framework successfully flags finished goods, such as electric generators (HS 8502), that are traded like scrap, providing a targeted list for regulatory scrutiny.

en econ.GN

Detail Sumber

arXiv Open Access 2025

A New Exploration into Chinese Characters: from Simplification to Deeper Understanding

Wen G. Gong

This paper presents a novel approach to Chinese characters through the lens of physics, network analysis, and natural systems. Computational analysis of over 6,000 characters identified 422 elemental characters as fundamental building blocks. Using a physics-inspired "Zi-Matrix" model, we analyzed character structure across eleven spatial positions, revealing systematic patterns in component relationships and semantic extension. Our research demonstrates that Chinese characters exhibit properties of natural systems: emergent complexity, self-organization, and adaptive resilience. The Fibonacci sequence provides an organizing framework for understanding character evolution, from simple pictographs to sophisticated abstractions. Case studies of character families and semantic networks show how meaning radiates from concrete to abstract domains while maintaining coherent principles. By viewing Chinese characters as a living system, this research transcends mere simplification to reveal how human cognition organizes and transmits knowledge. While the elemental character set reduces memorization burden, it also illuminates profound connections between language, thought, and natural patterns. Chinese characters emerge not just as tools for communication, but as windows into human understanding. This perspective, combined with AI-assisted learning approaches, promises to transform language education from knowledge mastery to meaning discovery, bridging traditional wisdom with modern computational methods.

en physics.soc-ph

Detail Sumber

arXiv Open Access 2025

Less is More: Empowering GUI Agent with Context-Aware Simplification

Gongwei Chen, Xurui Zhou, Rui Shao et al.

The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: 1) the high-density and loose-relation of element context highlight the existence of many unrelated elements and their negative influence; 2) the high redundancy of history context reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed SimpAgent. To mitigate potential interference from numerous unrelated elements, we introduce a masking-based element pruning method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a consistency-guided history compression module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.

en cs.CV, cs.AI

Detail Sumber

arXiv Open Access 2025

CAMEO: Autocorrelation-Preserving Line Simplification for Lossy Time Series Compression

Carlos Enrique Muñiz-Cuza, Matthias Boehm, Torben Bach Pedersen

Time series data from a variety of sensors and IoT devices need effective compression to reduce storage and I/O bandwidth requirements. While most time series databases and systems rely on lossless compression, lossy techniques offer even greater space-saving with a small loss in precision. However, the unknown impact on downstream analytics applications requires a semi-manual trial-and-error exploration. We initiate work on lossy compression that provides guarantees on complex statistical features (which are strongly correlated with the accuracy of the downstream analytics). Specifically, we propose a new lossy compression method that provides guarantees on the autocorrelation and partial-autocorrelation functions (ACF/PACF) of a time series. Our method leverages line simplification techniques as well as incremental maintenance of aggregates, blocking, and parallelization strategies for effective and efficient compression. The results show that our method improves compression ratios by 2x on average and up to 54x on selected datasets, compared to previous lossy and lossless compression methods. Moreover, we maintain -- and sometimes even improve -- the forecasting accuracy by preserving the autocorrelation properties of the time series. Our framework is extensible to multivariate time series and other statistical features of the time series.

en cs.DB, cs.IR

Detail Sumber

arXiv Open Access 2025

Lifting residual Galois representations with the same semi-simplification

Stefan Nikoloski

If a $p$-adic Galois representation $ρ_{f,ν}:Γ_{\mathbb Q} \to \GL_2(E_{f,ν})$ attached to some eigenform $f$ is residually reducible it will have 2 non-isomorphic reductions, which have the same semi-simplification. In this paper, we answer a version of the inverse question, first brought up by Toby Gee and Alice Pozzi. Starting with two modulo $p$ non-semi-simple representations $\overline{ρ_1},\overline{ρ_2}:Γ_{\mathbb Q} \to \GL_2(k)$, which have the same semi-simplification we show that under some mild conditions that they are reductions of representations attached to newforms of the same weight $r \ge 2$, the same level $N \ge 1$, and the same Neben character $\varepsilon$.

en math.NT

Detail Sumber

arXiv Open Access 2025

Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining

Matthew Theodore Roque, Dan John Velasco

Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data? and (2) Does ordering data by text complexity yield better representations? To answer, we build on a pair of parallel corpora where human-written paragraphs are aligned with LLM-simplified variants, and test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. We analyze models' representation quality from a sample efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Avoid Wasted Annotation Costs in Open-set Active Learning with Pre-trained Vision-Language Model

Jaehyuk Heo, Pilsung Kang

Active learning (AL) aims to enhance model performance by selectively collecting highly informative data, thereby minimizing annotation costs. However, in practical scenarios, unlabeled data may contain out-of-distribution (OOD) samples, which are not used for training, leading to wasted annotation costs if data is incorrectly selected. Therefore, to make active learning feasible in real-world applications, it is crucial to consider not only the informativeness of unlabeled samples but also their purity to determine whether they belong to the in-distribution (ID). Recent studies have applied AL under these assumptions, but challenges remain due to the trade-off between informativeness and purity, as well as the heavy dependence on OOD samples. These issues lead to the collection of OOD samples, resulting in a significant waste of annotation costs. To address these challenges, we propose a novel query strategy, VLPure-AL, which minimizes cost losses while reducing dependence on OOD samples. VLPure-AL sequentially evaluates the purity and informativeness of data. First, it utilizes a pre-trained vision-language model to detect and exclude OOD data with high accuracy by leveraging linguistic and visual information of ID data. Second, it selects highly informative data from the remaining ID data, and then the selected samples are annotated by human experts. Experimental results on datasets with various open-set conditions demonstrate that VLPure-AL achieves the lowest cost loss and highest performance across all scenarios. Code is available at https://github.com/DSBA-Lab/OpenAL.

en cs.CV, cs.AI

Detail Sumber

arXiv Open Access 2024

Jacobi Set Simplification for Tracking Topological Features in Time-Varying Scalar Fields

Dhruv Meduri, Mohit Sharma, Vijay Natarajan

The Jacobi set of a bivariate scalar field is the set of points where the gradients of the two constituent scalar fields align with each other. It captures the regions of topological changes in the bivariate field. The Jacobi set is a bivariate analog of critical points, and may correspond to features of interest. In the specific case of time-varying fields and when one of the scalar fields is time, the Jacobi set corresponds to temporal tracks of critical points, and serves as a feature-tracking graph. The Jacobi set of a bivariate field or a time-varying scalar field is complex, resulting in cluttered visualizations that are difficult to analyze. This paper addresses the problem of Jacobi set simplification. Specifically, we use the time-varying scalar field scenario to introduce a method that computes a reduced Jacobi set. The method is based on a stability measure called robustness that was originally developed for vector fields and helps capture the structural stability of critical points. We also present a mathematical analysis for the method, and describe an implementation for 2D time-varying scalar fields. Applications to both synthetic and real-world datasets demonstrate the effectiveness of the method for tracking features.

en math.NA, cs.CG

Detail DOI Sumber

arXiv Open Access 2023

Democratizing Making: Scaffolding Participation Using e-Waste to Engage Under-resourced Communities in Technology Design

Dhaval Vyas, Awais Hameed Khan, Anabelle Cooper

Maker culture and DIY practices are central to democratizing the design of technology; enabling non-designers (future end-users) to actively participate in the design process. However, little is known about how individuals from under-resourced communities and low socioeconomic status (SES) backgrounds, can practically leverage maker practices to design technology, creating value for themselves or their communities. To investigate this, we collaborated with an e-waste recycling centre, involving 24 participants (staff and low-SES volunteers) in two participatory maker workshop activities. Participants were provided with a generative e-waste toolkit, through which they repurposed e-waste materials and developed novel technology prototypes that created value from their perspectives and agendas. Our findings unpack three factors that influenced their making: balancing personal and community needs; incorporating convenience and productivity; and re-thinking sustainability and connection; and discuss strategies for scaffolding participation and engagement of under-resourced communities in making using an e-waste generative toolkit to democratize technology design.

en cs.HC

Detail DOI Sumber

CrossRef Open Access 2022

Towards Text Simplification in Spanish: A Brief Overview of Deep Learning Approaches for Text Simplification

Mario Romero, Saul Calderon-Ramirez, Martin Solis et al.

1 sitasi en

Detail DOI Sumber

arXiv Open Access 2022

Conciseness: An Overlooked Language Task

Felix Stahlberg, Aashish Kumar, Chris Alberti et al.

We report on novel investigations into training models that make sentences concise. We define the task and show that it is different from related tasks such as summarization and simplification. For evaluation, we release two test sets, consisting of 2000 sentences each, that were annotated by two and five human annotators, respectively. We demonstrate that conciseness is a difficult task for which zero-shot setups with large neural language models often do not perform well. Given the limitations of these approaches, we propose a synthetic data generation method based on round-trip translations. Using this data to either train Transformers from scratch or fine-tune T5 models yields our strongest baselines that can be further improved by fine-tuning on an artificial conciseness dataset that we derived from multi-annotator machine translation test sets.

en cs.CL

Detail Sumber

arXiv Open Access 2022

Forecasting Sensor Values in Waste-To-Fuel Plants: a Case Study

Bor Brecelj, Beno Šircelj, Jože M. Rožanec et al.

In this research, we develop machine learning models to predict future sensor readings of a waste-to-fuel plant, which would enable proactive control of the plant's operations. We developed models that predict sensor readings for 30 and 60 minutes into the future. The models were trained using historical data, and predictions were made based on sensor readings taken at a specific time. We compare three types of models: (a) a näive prediction that considers only the last predicted value, (b) neural networks that make predictions based on past sensor data (we consider different time window sizes for making a prediction), and (c) a gradient boosted tree regressor created with a set of features that we developed. We developed and tested our models on a real-world use case at a waste-to-fuel plant in Canada. We found that approach (c) provided the best results, while approach (b) provided mixed results and was not able to outperform the näive consistently.

en cs.AI, cs.LG

Detail Sumber

arXiv Open Access 2021

ZX-Calculus and Extended Wolfram Model Systems II: Fast Diagrammatic Reasoning with an Application to Quantum Circuit Simplification

Jonathan Gorard, Manojna Namuduri, Xerxes D. Arsiwalla

This article presents a novel algorithmic methodology for performing automated diagrammatic deductions over combinatorial structures, using a combination of modified equational theorem-proving techniques and the extended Wolfram model hypergraph rewriting formalism developed by the authors in previous work. We focus especially upon the application of this new algorithm to the problem of automated circuit simplification in quantum information theory, using Wolfram model multiway operator systems combined with the ZX-calculus formalism for enacting fast diagrammatic reasoning over linear transformations between qubits. We show how to construct a generalization of the deductive inference rules for Knuth-Bendix completion in which equation matches are selected on the basis of causal edge density in the associated multiway system, before proceeding to demonstrate how to embed the higher-order logic of the ZX-calculus rules within this first-order equational framework. After showing explicitly how the (hyper)graph rewritings of both Wolfram model systems and the ZX-calculus can be effectively realized within this formalism, we proceed to exhibit comparisons of time complexity vs. proof complexity for this new algorithmic approach when simplifying randomly-generated Clifford circuits down to pseudo-normal form, as well as when reducing the number of T-gates in randomly-generated non-Clifford circuits, with circuit sizes ranging up to 3000 gates, illustrating that the method performs favorably in comparison with existing circuit simplification frameworks, and also exhibiting the approximately quadratic speedup obtained by employing the causal edge density optimization. Finally, we present a worked example of an automated proof of correctness for a simple quantum teleportation protocol, in order to demonstrate more clearly the internal operations of the theorem-proving procedure.

en cs.LO, cs.DM

Detail Sumber

arXiv Open Access 2021

Learning about the reduction of food waste using Blockchain technology

Monica-Paula Marin, Iuliana Marin, Livia Vidu

Farmers need to be efficient and dedicate a lot of time in order to sustain the quality of their animals which are in their care. The most convenient and good quality - price ratio should be chosen for the feed of animals. Blockchain is used in a virtual space to store and share information over a network of users. This is done using the open source Hyperledger Fabric platform. The transactions can be viewed by all the other users in real time. These transactions are stored as JSONs inside CouchDB NoSQL database which supports queries on a large volume of data. When using this technology, the farmer can know with whom the supplier for animal feed collaborated with. The history of the transactions are not saved in just one place. In this way, it is more difficult to hack and provide implausible information. An e-learning platform was created where the farm's user can post information, respectively new blocks about the animal's birth, vaccinations, medicines, including the location of the livestock. The same e-learning platform is accessible from the mobile phone. By using the blockchain technology, anyone, including the client from the shop can know a lot about the origin of the products. Fake origins of food are much more difficult to hide. Fraud is also limited. The system monitored the traceability of dairy products inside a Romanian farm. Data about fodder provider and quality, cow productive performances and health and dairy products process were obtained and analyzed by students who will become specialists at all the levels of the food chain. Blockchain is the technology which in case of a dairy products contamination, the origin of the farm is traced in just a couple of seconds. In this way just a batch of dairy products is removed from distribution, leading to the reduction of food waste.

en cs.CY

Detail DOI Sumber

arXiv Open Access 2021

Understanding Neural Code Intelligence Through Program Simplification

Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, Mohammad Amin Alipour

A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior.

en cs.SE, cs.LG

Detail DOI Sumber

arXiv Open Access 2021

Analyzing a Knowledge Graph of Industry4.0 Standards

Irlan Grangel-Gonzalez, Maria-Esther Vidal

In this article, we tackle the problem of standard interoperability across different standardization frameworks, and devise a knowledge-driven approach that allows for the description of standards and standardization frameworks into an Industry 4.0 knowledge graph (I40KG). The STO ontology represents properties of standards and standardization frameworks, as well as relationships among them. The I40KG integrates more than 200 standards and four standardization frameworks. To populate the I40KG, the landscape of standards has been analyzed from a semantic perspective and the resulting I40KG represents knowledge expressed in more than 200 industrial related documents including technical reports, research articles, and white papers. Additionally, the I40KG has been linked to existing knowledge graphs and an automated reasoning has been implemented to reveal implicit relations between standards as well as mappings across standardization frameworks. We analyze both the number of discovered relations between standards and the accuracy of these relations. Observed results indicate that both reasoning and linking processes enable for increasing the connectivity in the knowledge graph by up to 80%, whilst up to 96% of the relations can be validated. These outcomes suggest that integrating standards and standardization frameworks into the I40KG enables the resolution of semantic interoperability conflicts, empowering the communication in smart factories.

en cs.DL, cs.DB

Detail DOI Sumber

Hasil untuk "Standardization. Simplification. Waste"