Hasil untuk "Greek philology and language"

Menampilkan 20 dari ~1458489 hasil · dari CrossRef, DOAJ, arXiv, Semantic Scholar

JSON API
arXiv Open Access 2025
Logios : An open source Greek Polytonic Optical Character Recognition system

Perifanos Konstantinos, Goutsos Dionisis

In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.

en cs.CV, cs.CL
arXiv Open Access 2025
Steering Language Models in Multi-Token Generation: A Case Study on Tense and Aspect

Alina Klerings, Jannik Brinkmann, Daniel Ruffinelli et al.

Large language models (LLMs) are able to generate grammatically well-formed text, but how do they encode their syntactic knowledge internally? While prior work has focused largely on binary grammatical contrasts, in this work, we study the representation and control of two multidimensional hierarchical grammar phenomena - verb tense and aspect - and for each, identify distinct, orthogonal directions in residual space using linear discriminant analysis. Next, we demonstrate causal control over both grammatical features through concept steering across three generation tasks. Then, we use these identified features in a case study to investigate factors influencing effective steering in multi-token generation. We find that steering strength, location, and duration are crucial parameters for reducing undesirable side effects such as topic shift and degeneration. Our findings suggest that models encode tense and aspect in structurally organized, human-like ways, but effective control of such features during generation is sensitive to multiple factors and requires manual tuning or automated optimization.

en cs.CL
arXiv Open Access 2025
Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda

Richard Kimera, Dongnyeong Heo, Daniela N. Rim et al.

In this paper,we explore the application of Back translation (BT) as a semi-supervised technique to enhance Neural Machine Translation(NMT) models for the English-Luganda language pair, specifically addressing the challenges faced by low-resource languages. The purpose of our study is to demonstrate how BT can mitigate the scarcity of bilingual data by generating synthetic data from monolingual corpora. Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques. We strategically select datasets for incremental back translation across multiple small datasets, which is a novel element of our approach. The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units across all translation directions. Additionally, our evaluation incorporates comprehensive assessment metrics such as SacreBLEU, ChrF2, and TER, providing a nuanced understanding of translation quality. The conclusion drawn from our research confirms the efficacy of BT when strategically curated datasets are utilized, establishing new performance benchmarks and demonstrating the potential of BT in enhancing NMT models for low-resource languages.

CrossRef Open Access 2024
Greek Motifs in the Works of M.Yu. Lermontov

I. S. Yukhnova

Aim. To highlight the specifics of Lermontov’s perception of the history and culture of Greece, to analyze Greek motifs in his work.Methodology. The article uncovers the sources, which shaped Lermontov’s perception of Greece, the circle of works that use plots and images of the classical antiquity and contain a response to the events of contemporary Greek history. The author uses biographical, cultural-historical, historical-functional methods of research.Results. The article comprehensively presents the features of artistic interpretation of the image of Greece in Lermontov’s works, demonstrates that both the past (classical era) and the present of the country are equally important to the author. Poet’s works interpret mythological plots and images, as well as Greek folk song “Olympus”, he lyrical plot of which is qualitatively rethought.Research implications. The article systematically presents the Greek theme in Lermontov’s works, which allows us to more fully reveal the problem of intercultural dialogue in the literature of the first half of the 19th century.

DOAJ Open Access 2024
Epigramy 4, 4 i 6, 93 jako przykład autoinspiracji Marcjalisa

Joanna Pieczonka

The article presents Martial’s epigrams nos. 4.4 and 6.93, which form a mini-cycle. They share a common topic, the same motifs and an analogical form of a priamel. In both poems the bad scent of the protagonist (Bassa and Thais) is compared to other unpleasant smells. Among these comparisons there are certain recurring odours: sulphur, fish, goat, leather, dyed material and bad breath. The epigram 4.4 is built as a synthetic priamel, 6.93 is an analytical one. The comparison of this pair of epigrams with another mini-cycle 3.65 and 11.8 lead the author the conclusion that Martial imitates his earlier poems in the ones written later. This phenomenon may be called a peculiar imitatio et aemulatio.

Philology. Linguistics, Greek language and literature. Latin language and literature
DOAJ Open Access 2024
Humoralna mapa mózgu w średniowiecznej uroskopii na przykładzie traktatu „De iudiciis urine” czternastowiecznego lekarza Tomasza z Wrocławia

Karolina Szula

The article discusses assumptions about brain structure based on the example of a passage from 14th century medical treatise De iudiciis urine by Thomas of Wroclaw. Medieval scientists used to explain functions of the brain relying on humoral theory and uroscopy. They divided the brain into four vessels containing four elements: blood, phlegm, bile and melancholy. This article explains why the brain structure was perceived this way and briefly discusses ancient beliefs about this organ, which influenced medieval theories. The paper also presents a transcription and Polish translation of a passage from Thomas’s work containing description of the brain anatomy.

Philology. Linguistics, Greek language and literature. Latin language and literature
arXiv Open Access 2024
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios

Samuel Ackerman, Ella Rabinovich, Eitan Farchi et al.

We evaluate the robustness of several large language models on multiple datasets. Robustness here refers to the relative insensitivity of the model's answers to meaning-preserving variants of their input. Benchmark datasets are constructed by introducing naturally-occurring, non-malicious perturbations, or by generating semantically equivalent paraphrases of input questions or statements. We further propose a novel metric for assessing a model robustness, and demonstrate its benefits in the non-adversarial scenario by empirical evaluation of several models on the created datasets.

en cs.CL, stat.AP
arXiv Open Access 2024
Training Data for Large Language Model

Yiming Ju, Huanhuan Ma

In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available open-source datasets.

en cs.AI
arXiv Open Access 2024
Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia

Lance Calvin Lim Gamboa, Mark Lee

Bias studies on multilingual models confirm the presence of gender-related stereotypes in masked models processing languages with high NLP resources. We expand on this line of research by introducing Filipino CrowS-Pairs and Filipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in pretrained language models (PLMs) handling texts in Filipino, a low-resource language from the Philippines. The benchmarks consist of 7,074 new challenge pairs resulting from our cultural adaptation of English bias evaluation datasets, a process that we document in detail to guide similar forthcoming efforts. We apply the Filipino benchmarks on masked and causal multilingual models, including those pretrained on Southeast Asian data, and find that they contain considerable amounts of bias. We also find that for multilingual models, the extent of bias learned for a particular language is influenced by how much pretraining data in that language a model was exposed to. Our benchmarks and insights can serve as a foundation for future work analyzing and mitigating bias in multilingual models.

en cs.CL
DOAJ Open Access 2023
Inscription from Caria Mentioning Antiochus III, Ptolemy V, and Philip V

Rossini, Alessandro

The ‘Three Kings’ Inscription’ bears the fragment of a Rhodian arbitration between Bargylia and another town in Caria. It is an important document, for it described major political events which affected western Caria on the eve of the Second Macedonian War (200‑197 BC). Above all, it makes it evident that Rhodes eventually came to know about the same Syro-Macedonian connivance against Egypt and the child king Ptolemy V which Polybius (3.2.8; 15.20) later described in big words. After a detailed status quaestionis, the importance of comparing the Rhodian, Ptolemaic, and Polybian perspectives on Antiochus III and Philip V’s conduct is stressed here.

Ancient history, Greek philology and language
arXiv Open Access 2023
Unveiling Safety Vulnerabilities of Large Language Models

George Kour, Marcel Zalmanovici, Naama Zwerdling et al.

As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.

en cs.CL, cs.AI
arXiv Open Access 2023
Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP

Wei Du, Laksh Advani, Yashmeet Gambhir et al.

Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for language models in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or silver labels. Results across various languages and domains show disagreement scores provide a better estimation of model performance with mean average error (MAE) as low as 0.4% and on average 13.8% better than using silver labels.

en cs.CL
arXiv Open Access 2021
Schrödinger's Tree -- On Syntax and Neural Language Models

Artur Kulmizev, Joakim Nivre

In the last half-decade, the field of natural language processing (NLP) has undergone two major transitions: the switch to neural networks as the primary modeling paradigm and the homogenization of the training regime (pre-train, then fine-tune). Amidst this process, language models have emerged as NLP's workhorse, displaying increasingly fluent generation capabilities and proving to be an indispensable means of knowledge transfer downstream. Due to the otherwise opaque, black-box nature of such models, researchers have employed aspects of linguistic theory in order to characterize their behavior. Questions central to syntax -- the study of the hierarchical structure of language -- have factored heavily into such work, shedding invaluable insights about models' inherent biases and their ability to make human-like generalizations. In this paper, we attempt to take stock of this growing body of literature. In doing so, we observe a lack of clarity across numerous dimensions, which influences the hypotheses that researchers form, as well as the conclusions they draw from their findings. To remedy this, we urge researchers make careful considerations when investigating coding properties, selecting representations, and evaluating via downstream tasks. Furthermore, we outline the implications of the different types of research questions exhibited in studies on syntax, as well as the inherent pitfalls of aggregate metrics. Ultimately, we hope that our discussion adds nuance to the prospect of studying language models and paves the way for a less monolithic perspective on syntax in this context.

en cs.CL
arXiv Open Access 2021
One to rule them all: Towards Joint Indic Language Hate Speech Detection

Mehar Bhatia, Tenzin Singhay Bhotia, Akshat Agarwal et al.

This paper is a contribution to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) 2021 shared task. Social media today is a hotbed of toxic and hateful conversations, in various languages. Recent news reports have shown that current models struggle to automatically identify hate posted in minority languages. Therefore, efficiently curbing hate speech is a critical challenge and problem of interest. We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection across three languages namely, English, Hindi, and Marathi. On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B. These results show the efficacy of exploiting a multilingual training scheme.

en cs.CL
arXiv Open Access 2021
Differentially Private Fine-tuning of Language Models

Da Yu, Saurabh Naik, Arturs Backurs et al.

We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $ε= 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of $ε= 6.8,δ=$ 1e-5) whereas the non-private baseline is $48.1$. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced.

en cs.LG, cs.CL
arXiv Open Access 2021
Primer: Searching for Efficient Transformers for Language Modeling

David R. So, Wojciech Mańke, Hanxiao Liu et al.

Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.

en cs.LG, cs.AI

Halaman 21 dari 72925