Hasil untuk "Greek philology and language"

Menampilkan 20 dari ~1456626 hasil · dari DOAJ, CrossRef, Semantic Scholar, arXiv

JSON API
arXiv Open Access 2025
Large language models have learned to use language

Gary Lupyan

Acknowledging that large language models have learned to use language can open doors to breakthrough language science. Achieving these breakthroughs may require abandoning some long-held ideas about how language knowledge is evaluated and reckoning with the difficult fact that we have entered a post-Turing test era.

en cs.CL
arXiv Open Access 2025
Natural Language Generation

Emiel van Miltenburg, Chenghua Lin

This article provides a brief overview of the field of Natural Language Generation. The term Natural Language Generation (NLG), in its broadest definition, refers to the study of systems that verbalize some form of information through natural language. That information could be stored in a large database or knowledge graph (in data-to-text applications), but NLG researchers may also study summarisation (text-to-text) or image captioning (image-to-text), for example. As a subfield of Natural Language Processing, NLG is closely related to other sub-disciplines such as Machine Translation (MT) and Dialog Systems. Some NLG researchers exclude MT from their definition of the field, since there is no content selection involved where the system has to determine what to say. Conversely, dialog systems do not typically fall under the header of Natural Language Generation since NLG is just one component of dialog systems (the others being Natural Language Understanding and Dialog Management). However, with the rise of Large Language Models (LLMs), different subfields of Natural Language Processing have converged on similar methodologies for the production of natural language and the evaluation of automatically generated text.

en cs.CL
arXiv Open Access 2025
Small Language Models Reshape Higher Education: Courses, Textbooks, and Teaching

Jian Zhang, Jia Shao

While large language models (LLMs) have introduced novel paradigms in science and education, their adoption in higher education is constrained by inherent limitations. These include a tendency to produce inaccuracies and high computational requirements, which compromise the strict demands for accurate and reliable knowledge essential in higher education. Small language models (MiniLMs), by contrast, offer distinct advantages in professional education due to their lightweight nature and precise retrieval capabilities. This research takes "Atmospheric Physics" as an example. We established a specialized corpus and image repository by gathering over 550,000 full-text PDFs from over 130 international well-respected journals in Earth and environmental science. From this collection, we extracted over 100 million high-quality sentence-level corpus and more than 3 million high-resolution academic images. Using MiniLMs, these resources were organized into a high-dimensional vector library for precise retrieval and efficient utilization of extensive educational content. Consequently, we systematically redesigned the courses, textbooks, and teaching strategies for "Atmospheric Physics" based on MiniLMs. The course is designed as a "interdisciplinary-frontier" system, breaking down traditional boundaries between atmospheric science, space science, hydrology, and remote sensing. Teaching materials are transformed from static, lagging text formats into a dynamic digital resource library powered by MiniLM. For teaching methods, we have designed a question-based learning pathway. This paradigm promotes a shift from passive knowledge transfer to active cognitive development. Consequently, this MiniLM-driven "Atmospheric Physics" course demonstrates a specific avenue for "AI for education".

en physics.ed-ph, cs.CL
arXiv Open Access 2025
Searching for the Most Human-like Emergent Language

Brendon Boldt, David Mortensen

In this paper, we design a signalling game-based emergent communication environment to generate state-of-the-art emergent languages in terms of similarity to human language. This is done with hyperparameter optimization, using XferBench as the objective function. XferBench quantifies the statistical similarity of emergent language to human language by measuring its suitability for deep transfer learning to human language. Additionally, we demonstrate the predictive power of entropy on the transfer learning performance of emergent language as well as corroborate previous results on the entropy-minimization properties of emergent communication systems. Finally, we report generalizations regarding what hyperparameters produce more realistic emergent languages, that is, ones which transfer better to human language.

en cs.CL
DOAJ Open Access 2024
Juegos de enamorados en De nuptiis Philolgiae et Mercurii de Marciano Capela

Julieta Cardigni

En su obra De nuptiis Philologiae et Mercurii, Marciano Capela propone un epitalamio alegórico (libros I y II) en el cual Mercurio busca esposa, y Filología es elegida para ese rol. Entre los libros III y IX se desarrolla el banquete de bodas, en el cual las Artes liberales, damas de honor de Filología, exponen los contenidos de las disciplinas que representan. Además de las frecuentes lecturas “directas” de la obra, que la postulan como una fuente de información valiosa, hay otras que se centran en elementos como la ficción y la parodia, que convierten a este manual en una controversial sátira menipea que, antes que proponer modelos de saber, los critica y ridiculiza. En línea con esta interpretación, nos centramos en este trabajo en la sección inicial del epitalamio alegórico (II.99-109), en que Filología realiza un juego numérico- literario para averiguar si su nombre y el de su prometido son compatibles para la vida matrimonial. Buscamos demostrar que la utilización del juego matemático, de tradición pitagórica y hermética, es subvertido por la pluma de Marciano y puesto en función del juego amoroso como un recurso más en su composición paródica.

Philology. Linguistics, Greek language and literature. Latin language and literature
arXiv Open Access 2024
Morphological evaluation of subwords vocabulary used by BETO language model

Óscar García-Sierra, Ana Fernández-Pampillón Cesteros, Miguel Ortega-Martín

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models' performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims and the model's configuration.

en cs.CL, cs.AI
arXiv Open Access 2024
Manipulating language models' training data to study syntactic constraint learning: the case of English passivization

Cara Su-Yi Leong, Tal Linzen

Grammatical rules in natural languages are often characterized by exceptions. How do language learners learn these exceptions to otherwise general patterns? Here, we study this question through the case study of English passivization. While passivization is in general quite productive, there are cases where it cannot apply (cf. the following sentence is ungrammatical: *One hour was lasted by the meeting). Using neural network language models as theories of language acquisition, we explore the sources of indirect evidence that a learner can leverage to learn whether a verb can be passivized. We first characterize English speakers' judgments of exceptions to the passive, and confirm that speakers find some verbs more passivizable than others. We then show that a neural network language model's verb passivizability judgments are largely similar to those displayed by humans, suggesting that evidence for these exceptions is available in the linguistic input. Finally, we test two hypotheses as to the source of evidence that language models use to learn these restrictions: frequency (entrenchment) and semantics (affectedness). We do so by training models on versions of the corpus that have had sentences of the types implicated by each hypothesis removed, altered, or introduced. We find support for both hypotheses: entrenchment and affectedness make independent contributions to a verb's passivizability. From a methodological point of view, this study highlights the utility of altering a language model's training data for answering questions where complete control over a learner's input is vital.

en cs.CL
arXiv Open Access 2024
A Survey of Large Language Models for Arabic Language and its Dialects

Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

en cs.CL, cs.AI
arXiv Open Access 2024
Model-based Large Language Model Customization as Service

Zhaomin Wu, Jizhou Guo, Junyi Hou et al.

Prominent Large Language Model (LLM) services from providers like OpenAI and Google excel at general tasks but often underperform on domain-specific applications. Current customization services for these LLMs typically require users to upload data for fine-tuning, posing significant privacy risks. While differentially private (DP) data synthesis presents a potential alternative, its application commonly results in low effectiveness due to the introduction of excessive noise on data for DP. To overcome this, we introduce Llamdex, a novel framework that facilitates LLM customization as a service, where the client uploads pre-trained domain-specific models rather than data. This client-uploaded model, optionally protected by DP with much lower noise, is inserted into the base LLM via connection modules. Significantly, these connecting modules are trained without requiring sensitive domain data, enabling clients to customize LLM services while preserving data privacy. Experiments demonstrate that Llamdex improves domain-specific accuracy by up to 26% over state-of-the-art private data synthesis methods under identical privacy constraints and, by obviating the need for users to provide domain context within queries, maintains inference efficiency comparable to the original LLM service.

en cs.LG, cs.AI
arXiv Open Access 2024
RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

Yinpei Dai, Jayjun Lee, Nima Fazeli et al.

Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io.

en cs.RO, cs.CL
arXiv Open Access 2024
Instruct-Tuning Pretrained Causal Language Models for Ancient Greek Papyrology and Epigraphy

Eric Cullhed

This article presents an experiment in fine-tuning a pretrained causal language model (Meta's Llama 3.1 8B Instruct) to assist with restoring missing or illegible characters in ancient Greek inscriptions and documentary papyri. Utilizing a straightforward instruction-based approach and a 95%/5% train/test split, the papyrus restoration model achieved a character error rate (CER) of 14.9%, a top-1 accuracy of 73.5%, and a top-20 accuracy of 86.0% for sequences up to 10 characters. A model was also fine-tuned for geographic attribution, reaching a top-1 accuracy of 66.4% and a top-3 accuracy of 79.9%. In chronological attribution, it demonstrated an average deviation of 21.7 years from the actual terminus post/ante quem, with a median deviation of 0 years. For inscriptions, the restoration model achieved a CER of 20.5%, a top-1 accuracy of 63.7%, and a top-20 accuracy of 83.0% for sequences up to 10 characters. In geographic attribution, it attained a top-1 accuracy of 75.0% and a top-3 accuracy of 83.7%, while in dating, it had an average deviation of 37.1 years and a median deviation of 3 years from the actual date range. Benchmarked against the state-of-the-art model (Ithaca) on a shared test set and on recently edited inscriptions, the instruction-tuned models excelled in text restoration, while also offering the practical advantage of ignoring spaces during reconstruction, which aligns with the scriptio continua of ancient textual artifacts. However, their performance in geographic and chronological attribution was lower than Ithaca's. To evaluate the approach in a more even setup, the instruction model was retrained with an 80%/10%/10% train-validation-test split, and still outperformed Ithaca in text restoration. The results suggest that fine-tuning larger pretrained causal language models using instruction templates for emendations and conjectures to ancient texts holds promise.

en cs.CL, cs.AI
arXiv Open Access 2023
GreekT5: A Series of Greek Sequence-to-Sequence Models for News Summarization

Nikolaos Giarelis, Charalampos Mastrokostas, Nikos Karacapilidis

Text summarization (TS) is a natural language processing (NLP) subtask pertaining to the automatic formulation of a concise and coherent summary that covers the major concepts and topics from one or multiple documents. Recent advancements in deep learning have led to the development of abstractive summarization transformer-based models, which outperform classical approaches. In any case, research in this field focuses on high resource languages such as English, while the corresponding work for low resource languages is still underdeveloped. Taking the above into account, this paper proposes a series of novel TS models for Greek news articles. The proposed models were thoroughly evaluated on the same dataset against GreekBART, which is the state-of-the-art model in Greek abstractive news summarization. Our evaluation results reveal that most of the proposed models significantly outperform GreekBART on various evaluation metrics. We make our evaluation code public, aiming to increase the reproducibility of this work and facilitate future research in the field.

en cs.CL, cs.AI
DOAJ Open Access 2022
Spiegare l’errore. A proposito di alcune recenti linee di ricerca

Chiara Rover, CR

Il presente contributo mira a presentare e discutere in chiave critica alcuni dei nove saggi contenuti nella raccolta Axiological Confusion and Its Causes («ΠΗΓΗ / FONS Revista de estudios sobre la civilización clásica y su recepción» 5, Special Issue, Universidad Carlos III de Madrid. Instituto de Estudios Clásicos “Lucio Anneo Séneca”, Madrid 2020), curata da E. Malaspina e J. Wildberger. La nota rivolge un’attenzione particolare ad alcune problematiche di epistemologia epicurea toccate dagli Autori, con l’intento di tornare a riflettere, ancora una volta, su temi assai controversi quali l’ἐπιβολή e la πρόληψις, la cui comprensione si rivela fondamentale al fine di spiegare il darsi dell’errore, sia gnoseologico che, per conseguenza, etico.

Philology. Linguistics, Greek language and literature. Latin language and literature
arXiv Open Access 2022
Inferring Rewards from Language in Context

Jessy Lin, Daniel Fried, Dan Klein et al.

In classic instruction following, language like "I'd like the JetBlue flight" maps to actions (e.g., selecting that flight). However, language also conveys information about a user's underlying reward function (e.g., a general preference for JetBlue), which can allow a model to carry out desirable actions in new contexts. We present a model that infers rewards from language pragmatically: reasoning about how speakers choose utterances not only to elicit desired actions, but also to reveal information about their preferences. On a new interactive flight-booking task with natural language, our model more accurately infers rewards and predicts optimal actions in unseen environments, in comparison to past work that first maps language to actions (instruction following) and then maps actions to rewards (inverse reinforcement learning).

en cs.CL, cs.AI
arXiv Open Access 2022
Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

Rodolfo Zevallos, Luis Camacho, Nelsi Melgarejo

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employ the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the end of the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully transcribed audio.

en cs.CL
arXiv Open Access 2022
A Survey in Mathematical Language Processing

Jordan Meadows, Andre Freitas

Informal mathematical text underpins real-world quantitative reasoning and communication. Developing sophisticated methods of retrieval and abstraction from this dual modality is crucial in the pursuit of the vision of automating discovery in quantitative science and mathematics. We track the development of informal mathematical language processing approaches across five strategic sub-areas in recent years, highlighting the prevailing successful methodological elements along with existing limitations.

en cs.CL

Halaman 6 dari 72832