Hasil "Germanic languages. Scandinavian languages"

arXiv Open Access 2025

MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema

Jugal Gajjar, Kamalasankari Subramaniakuppusamy

We introduce the MultiLang Code Parser Dataset (MLCPD), a large-scale, language-agnostic dataset unifying syntactic and structural representations of code across ten major programming languages. MLCPD contains over seven million parsed source files normalized under our proposed universal Abstract Syntax Tree (AST) schema, enabling consistent cross-language reasoning, structural learning, and multilingual software analysis. Unlike existing corpora that focus purely on token-level code or isolated parsers, MLCPD provides both hierarchical tree representations and rich metadata for every file, ensuring lossless syntactic coverage and structural uniformity. Each entry includes a normalized schema, language-level metadata, and abstracted node semantics stored in Parquet format for scalable retrieval. Empirical analyses reveal strong cross-language structural regularities-demonstrating that syntactic graphs from languages as diverse as Python, Java, and Go can be aligned under a shared schema. We release the dataset publicly on Hugging Face and the accompanying codebase on GitHub, which includes complete pipelines for dataset reproduction, grammar compilation, and a visualization tool for exploring the unified AST across languages. Together, these resources establish MLCPD as an open, reproducible foundation for future research in cross-language representation learning and program analysis.

en cs.SE, cs.LG

Detail Sumber

arXiv Open Access 2025

Qutes: A High-Level Quantum Programming Language for Simplified Quantum Computing

Simone Faro, Francesco Pio Marino, Gabriele Messina

Quantum computing leverages the principles of quantum mechanics to perform computations far beyond the capabilities of classical systems, particularly in fields such as cryptography and optimization. However, current quantum programming languages often require low-level implementation, posing significant barriers for many developers due to their steep learning curve and limited abstraction. In response, we introduce \textbf{Qutes}, a high-level quantum programming language designed to simplify quantum algorithm development while maintaining the flexibility required for advanced applications. By abstracting complex quantum operations and allowing intuitive expressions through high-level constructs, Qutes enables users to write efficient quantum programs without extensive knowledge of quantum mechanics or circuit design. Built upon Qiskit, Qutes translates its syntax directly into executable quantum code, facilitating seamless integration with quantum hardware. This paper provides an overview of the language's architecture, core functionalities, and its ability to unify classical and quantum operations within a single framework. Additionally, we demonstrate Qutes' application in key quantum algorithms, showcasing its potential to make quantum programming more accessible and practical for a wider range of developers and researchers.

en cs.PL

Detail Sumber

arXiv Open Access 2025

Towards Efficient Hash Maps in Functional Array Languages

William Henrich Due, Martin Elsman, Troels Henriksen

We present a systematic derivation of a data-parallel implementation of two-level, static and collision-free hash maps, by giving a functional formulation of the Fredman et al. construction, and then flattening it. We discuss the challenges of providing a flexible, polymorphic, and abstract interface to hash maps in a functional array language, with particular attention paid to the problem of dynamically sized keys, which we address by associating each hash map with an arbitrary context. The algorithm is implemented in Futhark, and the achieved GPU execution performance is compared on simple benchmark problems. We find that our hash maps outperform conventional tree/search-based approaches. Furthermore, our implementation is compared against the state-of-the-art cuCollections library, which is significantly faster for hash map construction, and to a lesser degree for lookups. We explain to which extent the performance difference is due to low-level code generation limitation in the Futhark compiler, and to which extent it can be attributed to the data-parallel programming vocabulary not providing the constructs necessary to express the equivalent of the algorithms used by cuCollections. We end by reflecting to which extent the functional array language programming model could, or should, be extended to address these weaknesses.

en cs.PL, cs.DS

Detail Sumber

arXiv Open Access 2025

Language-Integrated Recursive Queries

Anna Herlihy, Amir Shaikhha, Anastasia Ailamaki et al.

Performance-critical industrial applications, including large-scale program, network, and distributed system analyses, rely on fixed-point computations. The introduction of recursive common table expressions (CTEs) using the WITH RECURSIVE keyword in SQL:1999 extended the ability of relational database systems to handle fixed-point computations, unlocking significant performance advantages by allowing computation to move closer to the data. Yet with recursion, SQL becomes a Turing-complete programming language and, with that, unrecoverable safety and correctness risks. SQL itself lacks a fixed semantics, as the SQL specification is written in natural language, full of ambiguities that database vendors resolve in divergent ways. As a result, reasoning about the correctness of recursive SQL programs must rely on isolated mathematical properties of queries rather than wrestling a unified formal model out of a language with notoriously inconsistent semantics. To address these challenges, we propose a calculus that automatically derives mathematical properties from embedded recursive queries and, depending on the database backend, rejects queries that may lead to the three classes of recursive query errors - database errors, incorrect results, and non-termination. We introduce TyQL, a practical implementation in Scala for safe, recursive language-integrated query. Using Named-Tuples and type-level pattern matching, TyQL ensures query portability and safety, showing no performance penalty compared to raw SQL strings while unlocking a three-orders-of-magnitude speedup over non-recursive SQL queries.

en cs.PL

Detail Sumber

arXiv Open Access 2024

A Language-Theoretic Approach to the Heapability of Signed Permutations

Gabriel Istrate

We investigate a signed version of the Hammersley process, a discrete process on words related to a property of integer sequences called heapability (Byers et al., ANALCO 2011). The specific version that we investigate corresponds to a version of this property for signed sequences. We give a characterization of the words that can appear as images the signed Hammersley process. In particular we show that the language of such words is the intersection of two deterministic one-counter languages.

en math.CO, cs.DM

Detail Sumber

CrossRef Open Access 2023

Recent Morphology Explorations in Romance Languages

Leonardo Maria Savoia

According to a traditional vision, natural languages are systems that combine words in sequences to which syntax gives a logical organization [...]

en

Detail DOI Sumber

arXiv Open Access 2023

Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training

Miriam Anschütz, Joshua Oehms, Thomas Wimmer et al.

Automatic text simplification systems help to reduce textual information barriers on the internet. However, for languages other than English, only few parallel data to train these systems exists. We propose a two-step approach to overcome this data scarcity issue. First, we fine-tuned language models on a corpus of German Easy Language, a specific style of German. Then, we used these models as decoders in a sequence-to-sequence simplification task. We show that the language models adapt to the style characteristics of Easy Language and output more accessible texts. Moreover, with the style-specific pre-training, we reduced the number of trainable parameters in text simplification models. Hence, less parallel data is sufficient for training. Our results indicate that pre-training on unaligned data can reduce the required parallel data while improving the performance on downstream tasks.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2022

Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray et al.

Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The target-to-source model generates noisy sources, while the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Hence, training them to build programming language translation systems via back-translation is compelling. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we propose performing back-translation via code summarization and generation. In code summarization, a model learns to generate natural language (NL) summaries given code snippets. In code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as a target-to-NL-to-source generation. We show that our proposed approach performs competitively with state-of-the-art methods. We have made the code publicly available.

en cs.CL, cs.PL

Detail Sumber

arXiv Open Access 2022

Benchmarking Large Language Models for Automated Verilog RTL Code Generation

Shailja Thakur, Baleegh Ahmad, Zhenxing Fan et al.

Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.

en cs.PL, cs.LG

Detail Sumber

arXiv Open Access 2021

Properties of Graphs Specified by a Regular Language

Volker Diekert, Henning Fernau, Petra Wolf

Traditionally, graph algorithms get a single graph as input, and then they should decide if this graph satisfies a certain property $Φ$. What happens if this question is modified in a way that we get a possibly infinite family of graphs as an input, and the question is if there is a graph satisfying $Φ$ in the family? We approach this question by using formal languages for specifying families of graphs, in particular by regular sets of words. We show that certain graph properties can be decided by studying the syntactic monoid of the specification language $L$ if a certain torsion condition is satisfied. This condition holds trivially if $L$ is regular. More specifically, we use a natural binary encoding of finite graphs over a binary alphabet $Σ$, and we define a regular set $\mathbb{G}\subseteq Σ^*$ such that every nonempty word $w\in \mathbb{G}$ defines a finite and nonempty graph. Also, graph properties can then be syntactically defined as languages over $Σ$. Then, we ask whether the automaton $\mathcal{A}$ specifies some graph satisfying a certain property~$Φ$. Our structural results show that we can answer this question for all "typical" graph properties. In order to show our results, we split $L$ into a finite union of subsets and every subset of this union defines in a natural way a single finite graph $F$ where some edges and vertices are marked. The marked graph in turn defines an infinite graph $F^\infty$ and therefore the family of finite subgraphs of $F^\infty$ where $F$ appears as an induced subgraph. This yields a geometric description of all graphs specified by $L$ based on splitting $L$ into finitely many pieces; then using the notion of graph retraction, we obtain an easily understandable description of the graphs in each piece.

en cs.FL

Detail Sumber

arXiv Open Access 2021

Categorical semantics of a simple differential programming language

Geoffrey Cruttwell, Jonathan Gallagher, Dorette Pronk

With the increased interest in machine learning, and deep learning in particular, the use of automatic differentiation has become more wide-spread in computation. There have been two recent developments to provide the theoretical support for this types of structure. One approach, due to Abadi and Plotkin, provides a simple differential programming language. Another approach is the notion of a reverse differential category. In the present paper we bring these two approaches together. In particular, we show how an extension of reverse derivative categories models Abadi and Plotkin's language, and describe how this categorical model allows one to consider potential improvements to the operational semantics of the language.

en math.CT, cs.PL

Detail DOI Sumber

arXiv Open Access 2020

Reusing Static Analysis across Different Domain-Specific Languages using Reference Attribute Grammars

Johannes Mey, Thomas Kühn, René Schöne et al.

Context: Domain-specific languages (DSLs) enable domain experts to specify tasks and problems themselves, while enabling static analysis to elucidate issues in the modelled domain early. Although language workbenches have simplified the design of DSLs and extensions to general purpose languages, static analyses must still be implemented manually. Inquiry: Moreover, static analyses, e.g., complexity metrics, dependency analysis, and declaration-use analysis, are usually domain-dependent and cannot be easily reused. Therefore, transferring existing static analyses to another DSL incurs a huge implementation overhead. However, this overhead is not always intrinsically necessary: in many cases, while the concepts of the DSL on which a static analysis is performed are domain-specific, the underlying algorithm employed in the analysis is actually domain-independent and thus can be reused in principle, depending on how it is specified. While current approaches either implement static analyses internally or with an external Visitor, the implementation is tied to the language's grammar and cannot be reused easily. Thus far, a commonly used approach that achieves reusable static analysis relies on the transformation into an intermediate representation upon which the analysis is performed. This, however, entails a considerable additional implementation effort. Approach: To remedy this, it has been proposed to map the necessary domain-specific concepts to the algorithm's domain-independent data structures, yet without a practical implementation and the demonstration of reuse. Thus, to make static analysis reusable again, we employ relational Reference Attribute Grammars (RAGs) by creating such a mapping to a domain-independent overlay structure using higher-order attributes. Knowledge: We describe how static analysis can be specified on analysis-specific data structures, how relational RAGs can help with the specification, and how a mapping from the domain-specific language can be performed. Furthermore, we demonstrate how a static analysis for a DSL can be externalized and reused in another general purpose language. Grounding: The approach was evaluated using the RAG system JastAdd. To illustrate reusability, we implemented two analyses with two addressed languages each: a cycle detection analysis used in a small state machine DSL and for detecting circular dependencies in Java types and packages, and an analysis of variable shadowing, applied to both Java and the Modelica modelling language. Thereby, we demonstrate the reuse of two analysis algorithms in three completely different domains. Additionally, we use the cycle detection analysis to evaluate the efficiency by comparing our external analysis to an internal reference implementation analysing all Java programs in the Qualitas Corpus and thereby are able to show that an externalized analysis incurs only minimal overhead. Importance: We make static analysis reusable, again, showing the practicality and efficiency of externalizing static analysis for both DSLs and general purpose languages using relational RAGs.

en cs.PL

Detail DOI Sumber

arXiv Open Access 2019

Sliding window property testing for regular languages

Moses Ganardi, Danny Hucke, Markus Lohrey et al.

We study the problem of recognizing regular languages in a variant of the streaming model of computation, called the sliding window model. In this model, we are given a size of the sliding window $n$ and a stream of symbols. At each time instant, we must decide whether the suffix of length $n$ of the current stream ("the active window") belongs to a given regular language. Recent works showed that the space complexity of an optimal deterministic sliding window algorithm for this problem is either constant, logarithmic or linear in the window size $n$ and provided natural language theoretic characterizations of the space complexity classes. Subsequently, those results were extended to randomized algorithms to show that any such algorithm admits either constant, double logarithmic, logarithmic or linear space complexity. In this work, we make an important step forward and combine the sliding window model with the property testing setting, which results in ultra-efficient algorithms for all regular languages. Informally, a sliding window property tester must accept the active window if it belongs to the language and reject it if it is far from the language. We consider deterministic and randomized sliding window property testers with one-sided and two-sided errors. In particular, we show that for any regular language, there is a deterministic sliding window property tester that uses logarithmic space and a randomized sliding window property tester with two-sided error that uses constant space.

en cs.DS, cs.CL

Detail Sumber

arXiv Open Access 2019

TryLinks: An interactive tutorial system for a cross-tier Web programming language

Junao Wu, Arek Mikolajczak, James Cheney

Links is a web programming language under development in Edinburgh aimed at simplifying web development. Conventional multi-tier applications involve programming in several languages for different layers, and the mismatches between these layers and abstractions need to be handled by the programmer, which can lead to costly errors or security vulnerabilities. In Links, programs combine all of the code of a web application in a single program, and the implementation generates appropriate JavaScript and HTML for the client, and SQL queries for the database. However, installing and using Links is non-trivial, making it difficult for new programmers to get started and learn about Links. This paper reports on a Web-based "TryLinks" system, allowing anyone to experiment with Links without going through the installation process. TryLinks was designed with two major functionalities: an interactive Links shell that teaches the basic syntax of Links and acts as a playground, as well as a short tutorial series on how Links is used in practical web development. Tutorials can also be created or modified by administrators. We present the design and implementation of TryLinks, and conclude with discussion of lessons learned from this project and remaining challenges for Web-based tutorials for Web programming languages.

en cs.PL

Detail DOI Sumber

arXiv Open Access 2019

Effects Without Monads: Non-determinism -- Back to the Meta Language

Oleg Kiselyov

We reflect on programming with complicated effects, recalling an undeservingly forgotten alternative to monadic programming and checking to see how well it can actually work in modern functional languages. We adopt and argue the position of factoring an effectful program into a first-order effectful DSL with a rich, higher-order 'macro' system. Not all programs can be thus factored. Although the approach is not general-purpose, it does admit interesting programs. The effectful DSL is likewise rather problem-specific and lacks general-purpose monadic composition, or even functions. On the upside, it expresses the problem elegantly, is simple to implement and reason about, and lends itself to non-standard interpretations such as code generation (compilation) and abstract interpretation. A specialized DSL is liable to be frequently extended; the experience with the tagless-final style of DSL embedding shown that the DSL evolution can be made painless, with the maximum code reuse. We illustrate the argument on a simple but representative example of a rather complicated effect -- non-determinism, including committed choice. Unexpectedly, it turns out we can write interesting non-deterministic programs in an ML-like language just as naturally and elegantly as in the functional-logic language Curry -- and not only run them but also statically analyze, optimize and compile. The richness of the Meta Language does, in reality, compensate for the simplicity of the effectful DSL. The key idea goes back to the origins of ML as the Meta Language for the Edinburgh LCF theorem prover. Instead of using ML to build theorems, we now build (DSL) programs.

en cs.PL, cs.LO

Detail DOI Sumber

arXiv Open Access 2016

Just-in-Time Static Type Checking for Dynamic Languages

Brianna M. Ren, Jeffrey S. Foster

Dynamic languages such as Ruby, Python, and JavaScript have many compelling benefits, but the lack of static types means subtle errors can remain latent in code for a long time. While many researchers have developed various systems to bring some of the benefits of static types to dynamic languages, prior approaches have trouble dealing with metaprogramming, which generates code as the program executes. In this paper, we propose Hummingbird, a new system that uses a novel technique, just-in-time static type checking, to type check Ruby code even in the presence of metaprogramming. In Hummingbird, method type signatures are gathered dynamically at run-time, as those methods are created. When a method is called, Hummingbird statically type checks the method body against current type signatures. Thus, Hummingbird provides thorough static checks on a per-method basis, while also allowing arbitrarily complex metaprogramming. For performance, Hummingbird memoizes the static type checking pass, invalidating cached checks only if necessary. We formalize Hummingbird using a core, Ruby-like language and prove it sound. To evaluate Hummingbird, we applied it to six apps, including three that use Ruby on Rails, a powerful framework that relies heavily on metaprogramming. We found that all apps typecheck successfully using Hummingbird, and that Hummingbird's performance overhead is reasonable. We applied Hummingbird to earlier versions of one Rails app and found several type errors that had been introduced and then fixed. Lastly, we demonstrate using Hummingbird in Rails development mode to typecheck an app as live updates are applied to it.

en cs.PL

Detail Sumber

CrossRef Open Access 2015

Writing the Germanic Languages: The Early History of the Digraphs <th>, <ch> and <uu>

Annina Seiler

1 sitasi en

Detail DOI Sumber

arXiv Open Access 2015

Pushdown Control-Flow Analysis for Free

Thomas Gilray, Steven Lyde, Michael D. Adams et al.

Traditional control-flow analysis (CFA) for higher-order languages, whether implemented by constraint-solving or abstract interpretation, introduces spurious connections between callers and callees. Two distinct invocations of a function will necessarily pollute one another's return-flow. Recently, three distinct approaches have been published which provide perfect call-stack precision in a computable manner: CFA2, PDCFA, and AAC. Unfortunately, CFA2 and PDCFA are difficult to implement and require significant engineering effort. Furthermore, all three are computationally expensive; for a monovariant analysis, CFA2 is in $O(2^n)$, PDCFA is in $O(n^6)$, and AAC is in $O(n^9 log n)$. In this paper, we describe a new technique that builds on these but is both straightforward to implement and computationally inexpensive. The crucial insight is an unusual state-dependent allocation strategy for the addresses of continuation. Our technique imposes only a constant-factor overhead on the underlying analysis and, with monovariance, costs only O(n3) in the worst case. This paper presents the intuitions behind this development, a proof of the precision of this analysis, and benchmarks demonstrating its efficacy.

en cs.PL

Detail DOI Sumber

arXiv Open Access 2013

A fast method for implementation of the property lists in programming languages

Hassan Rashidi

One of the major challenges in programming languages is to support different data structures and their variations in both static and dynamic aspects. One of the these data structures is the property list which applications use it as a convenient way to store, organize, and access standard types of data. In this paper, the standards methods for implementation of the Property Lists, including the Static Array, Link List, Hash and Tree are reviewed. Then an efficient method to implement the property list is presented. The experimental results shows that our method is fast compared with the existing methods.

en cs.PL, cs.DB

Detail Sumber

CrossRef Open Access 2011

Reading across the Germanic languages: Is equal access just wishful thinking?

Nicole Marx

To meet increasing needs for multilingual speakers, varied concepts such as Intercomprehension Pedagogy have evolved. Here, speakers of one language concurrently develop reading skills in multiple etymologically related languages. A prerequisite of such projects is information about mutual comprehensibility, including which languages might be easier to comprehend in which situations and for which learners. The present study investigated 73 German university students who read a short text in an unknown Germanic language (Danish, Dutch, Icelandic, Norwegian or Swedish) and then answered specific questions on content and structure. Results showed significantly higher success rates on all measures for Dutch, with isolated significant differences between the other languages and higher success in decoding lexical, in comparison to structural, information. No individual learner variables were found to be correlated with success. These data have implications for planning courses in Germanic intercomprehension, especially regarding time allotted the various languages.

1 sitasi en

Detail DOI Sumber

Hasil untuk "Germanic languages. Scandinavian languages"