This volume presents the results of a collaboration between researchers at the Scuola Normale Superiore, Pisa, and Princeton University. It introduces readers to Logion, the first deep neural network designed to support philologists in editing Greek texts that have reached us via manuscript transmission. The model offers suggestions for how to fill lacunae; it also helps identify and emend portions of text that seem to have been corrupted in the process of textual transmission. The different chapters explore the theory and historical practice of philology, present the AI model we designed, and make use of Logion in elucidating passages of premodern Greek. The focus is on the vast corpus of the Byzantine polymath Michael Psellos, which represents an excellent case study, from both a philological and technical perspective. Overall, the aim of this volume is both practical and intellectual. It seeks to encourage a broader dialogue between artificial intelligence and Greek philology and to present an actual model for the use of machine learning in elucidating premodern texts.
Hasil untuk "Greek philology and language"
Menampilkan 20 dari ~1458009 hasil · dari CrossRef, DOAJ, arXiv
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov
Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable. Several NLP applications are ubiquitous, partly due to the myriad of datasets being churned out daily through mediums like social networking sites. However, the growing development has not been evident in most African languages due to the persisting resource limitations, among other issues. Yorùbá language, a tonal and morphologically rich African language, suffers a similar fate, resulting in limited NLP usage. To encourage further research towards improving this situation, this systematic literature review aims to comprehensively analyse studies addressing NLP development for Yorùbá, identifying challenges, resources, techniques, and applications. A well-defined search string from a structured protocol was employed to search, select, and analyse 105 primary studies between 2014 and 2024 from reputable databases. The review highlights the scarcity of annotated corpora, the limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles. It also revealed the prominent techniques, including rule-based methods, among others. The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and the desertion of language for digital usage. This review synthesises existing research, providing a foundation for advancing NLP for Yorùbá and in African languages generally. It aims to guide future research by identifying gaps and opportunities, thereby contributing to the broader inclusion of Yorùbá and other under-resourced African languages in global NLP advancements.
Neelabh Sinha
Language Models (LMs) have revolutionized natural language processing, enabling high-quality text generation through prompting and in-context learning. However, models often struggle with long-context summarization due to positional biases, leading to suboptimal extraction of critical information. There are techniques to improve this with fine-tuning, pipelining, or using complex techniques, which have their own challenges. To solve these challenges, we propose QA-prompting - a simple prompting method for summarization that utilizes question-answering as an intermediate step prior to summary generation. Our method extracts key information and enriches the context of text to mitigate positional biases and improve summarization in a single LM call per task without requiring fine-tuning or pipelining. Experiments on multiple datasets belonging to different domains using ten state-of-the-art pre-trained models demonstrate that QA-prompting outperforms baseline and other state-of-the-art methods, achieving up to 29% improvement in ROUGE scores. This provides an effective and scalable solution for summarization and highlights the importance of domain-specific question selection for optimal performance.
Keliang Liu, Dingkang Yang, Ziyun Qian et al.
In recent years, training methods centered on Reinforcement Learning (RL) have markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs), particularly in understanding human intents, following user instructions, and bolstering inferential strength. Although existing surveys offer overviews of RL augmented LLMs, their scope is often limited, failing to provide a comprehensive summary of how RL operates across the full lifecycle of LLMs. We systematically review the theoretical and practical advancements whereby RL empowers LLMs, especially Reinforcement Learning with Verifiable Rewards (RLVR). First, we briefly introduce the basic theory of RL. Second, we thoroughly detail application strategies for RL across various phases of the LLM lifecycle, including pre-training, alignment fine-tuning, and reinforced reasoning. In particular, we emphasize that RL methods in the reinforced reasoning phase serve as a pivotal driving force for advancing model reasoning to its limits. Next, we collate existing datasets and evaluation benchmarks currently used for RL fine-tuning, spanning human-annotated datasets, AI-assisted preference data, and program-verification-style corpora. Subsequently, we review the mainstream open-source tools and training frameworks available, providing clear practical references for subsequent research. Finally, we analyse the future challenges and trends in the field of RL-enhanced LLMs. This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs, with the goal of fostering the evolution of LLMs that are more intelligent, generalizable, and secure.
Daniel Teubner
Xudong Lu, Qi Liu, Yuhui Xu et al.
A pivotal advancement in the progress of large language models (LLMs) is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and increase the inference speed, while maintaining satisfactory performance. Data and code will be available at https://github.com/Lucky-Lance/Expert_Sparsity.
Gihun Lee, Minchan Jeong, Yujin Kim et al.
While learning to align Large Language Models (LLMs) with human preferences has shown remarkable success, aligning these models to meet the diverse user preferences presents further challenges in preserving previous knowledge. This paper examines the impact of personalized preference optimization on LLMs, revealing that the extent of knowledge loss varies significantly with preference heterogeneity. Although previous approaches have utilized the KL constraint between the reference model and the policy model, we observe that they fail to maintain general knowledge and alignment when facing personalized preferences. To this end, we introduce Base-Anchored Preference Optimization (BAPO), a simple yet effective approach that utilizes the initial responses of reference model to mitigate forgetting while accommodating personalized alignment. BAPO effectively adapts to diverse user preferences while minimally affecting global knowledge or general alignment. Our experiments demonstrate the efficacy of BAPO in various setups.
Giuseppe De Gregorio, Lavinia Ferretti, Rodrigo C. G. Pena et al.
The study of Greek papyri from ancient Egypt is fundamental for understanding Graeco-Roman Antiquity, offering insights into various aspects of ancient culture and textual production. Palaeography, traditionally used for dating these manuscripts, relies on identifying chronologically relevant features in handwriting styles yet lacks a unified methodology, resulting in subjective interpretations and inconsistencies among experts. Recent advances in digital palaeography, which leverage artificial intelligence (AI) algorithms, have introduced new avenues for dating ancient documents. This paper presents a comparative analysis between an AI-based computational dating model and human expert palaeographers, using a novel dataset named Hell-Date comprising securely fine-grained dated Greek papyri from the Hellenistic period. The methodology involves training a convolutional neural network on visual inputs from Hell-Date to predict precise dates of papyri. In addition, experts provide palaeographic dating for comparison. To compare, we developed a new framework for error analysis that reflects the inherent imprecision of the palaeographic dating method. The results indicate that the computational model achieves performance comparable to that of human experts. These elements will help assess on a more solid basis future developments of computational algorithms to date Greek papyri.
Todd K Moon, Jacob H. Gunther
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of \emph{The Federalist Papers}. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of "proof texts," The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statistical natural language parser were explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Harish Tayyar Madabushi
We present a novel extension to Retrieval Augmented Generation with the goal of mitigating factual inaccuracies in the output of large language models. Specifically, our method draws on the cognitive linguistic theory of frame semantics for the indexing and retrieval of factual information relevant to helping large language models answer queries. We conduct experiments to demonstrate the effectiveness of this method both in terms of retrieval effectiveness and in terms of the relevance of the frames and frame relations automatically generated. Our results show that this novel mechanism of Frame Semantic-based retrieval, designed to improve Retrieval Augmented Generation (FS-RAG), is effective and offers potential for providing data-driven insights into frame semantics theory. We provide open access to our program code and prompts.
Massinelli, Rebecca
The contribution offers the first edition of a small lead olla preserved in the storerooms of the National Archaeological Museum of Taranto, made from a bivalve matrix and intended for the trade of medicinal ointments or eye drops. The discreetly preserved object bears a matrix inscription consisting of an idionym in genitive case: Παμφίλου, from Pamphilos, identifying the φαρμακοπώλης according to whose recipe the medicamentum was made. The name is massively attested in Attica and the object therefore seems referable, in terms of form and material of manufacture, to the ‘Athenian or Central-Mediterranean’ area.
Alexander Ramos Lopes
Marucci, Marta
L’iscrizione metrica, scoperta a Tirreo in Acarnania e trascritta da un abitante del luogo, è risultata immediatamente irreperibile. Il testo narra le gesta eroiche del soldato Timocrito, caduto in uno scontro con gli Etoli; nel distico elegiaco finale vi è un esplicito riferimento al poeta Tirteo. Sin da Klaffenbach, editore principe, l’epigramma è stato concordemente datato alla fine del III secolo a.C. sulla base dello stile e del contenuto: si è ipotizzato che l’evento bellico sia da collegare a uno degli assalti che gli Etoli compirono contro l’Acarnania durante la guerra sociale del 220‑217 a.C. Un riesame del testo e un confronto con le altre epigrafi metriche rinvenute in loco e con le fonti storiche permette di avanzare una proposta di datazione alternativa. Secondo quanto racconta Tito Livio, la polis di Tirreo fu assaltata nuovamente nel 192‑191 a.C. dalle truppe etolo-siriache di Antioco III. In assenza di un esame autoptico o di immagini del reperto, sembra dunque plausibile ascrivere l’iscrizione tanto alla fine del III secolo a.C., legata agli eventi della guerra sociale del 220‑217 a.C., quanto all’inizio del II secolo a.C., connessa all’attacco etolo-siriaco del 192‑191 a.C.
Hanwool Lee, Jonghyun Choi, Sohyeon Kwon et al.
This paper presents our participation in the FinNLP-2023 shared task on multi-lingual environmental, social, and corporate governance issue identification (ML-ESG). The task's objective is to classify news articles based on the 35 ESG key issues defined by the MSCI ESG rating guidelines. Our approach focuses on the English and French subtasks, employing the CerebrasGPT, OPT, and Pythia models, along with the zero-shot and GPT3Mix Augmentation techniques. We utilize various encoder models, such as RoBERTa, DeBERTa, and FinBERT, subjecting them to knowledge distillation and additional training. Our approach yielded exceptional results, securing the first position in the English text subtask with F1-score 0.69 and the second position in the French text subtask with F1-score 0.78. These outcomes underscore the effectiveness of our methodology in identifying ESG issues in news articles across different languages. Our findings contribute to the exploration of ESG topics and highlight the potential of leveraging advanced language models for ESG issue identification.
Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen et al.
Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs). However, it is unclear how to collect and incorporate feedback in a way that is efficient, effective and unbiased, especially for highly subjective human preferences and values. In this paper, we survey existing approaches for learning from human feedback, drawing on 95 papers primarily from the ACL and arXiv repositories.First, we summarise the past, pre-LLM trends for integrating human feedback into language models. Second, we give an overview of present techniques and practices, as well as the motivations for using feedback; conceptual frameworks for defining values and preferences; and how feedback is collected and from whom. Finally, we encourage a better future of feedback learning in LLMs by raising five unresolved conceptual and practical challenges.
Maria Frangi
Elena Köstner, EK
Im Jahre 66 v. Chr. übernahm Cicero die juristische Vertretung des Cluentius Habitus minor, der beschuldigt worden war, seinen Stiefvater Oppianicus maior ermordet zu haben. In dieser Studie geht es um Strategien der Kontingenzbewältigung und des Zukunftshandelns in Bezug auf Testamente und im Rahmen einer Verteidigungsstrategie vor Gericht, wobei Pro A. Cluentio Habito die Ausgangslage darstellt. Wenn Kontingenz als das Mögliche oder Alternative zur existenten Wirklichkeit verstanden werden kann, dann sind Strategien der Kontingenzbewältigung und des Zukunftshandelns als dasjenige Tun zu erachten, das vorwegnehmend einzelne Handlungen koordiniert, Handlungsoptionen reduziert und auf diese Weise zukünftiges Handeln erwartbarer macht. In diesem Sinne kann ein Testament als Plan des Erblassers*der Erblasserin für eine Zukunft ohne ihn*sie verstanden werden, d.h. er*sie trifft Vorkehrungen und Vorsorge für Familienangehörige und Freund*innen. Einem Gerichtsverfahren adäquat begegnen zu können, bedarf ebenso eines Plans im Sinne einer Strategie, um die Argumente der Anklage zu entkräften, die beschuldigte Person ins rechte Licht zu rücken und eine*n alternative*n Tatverdächtige*n zu präsentieren. Diese beiden eben skizzierten Ebenen sind in der Gerichtsrede untrennbar miteinander verwoben, stand doch mit dem erhobenen Vorwurf des Mordes an Oppianicus maior der Habitus der involvierten Personen besonders im Fokus. Cicero evoziert in verschiedenen Schriften, dass für ihn die voluntas testatoris unbedingt zu erfüllen sei, auch wenn diese nicht immer konform ging mit den Gesetzen. Ein derartiges, selbst auferlegtes Prinzip bot Orientierung und zeugte gleichzeitig von pietas gegenüber dem*der Erblasser*in.
Ying Yin, Ivan Habernal
Pre-training large transformer models with in-domain data improves domain adaptation and helps gain performance on the domain-specific downstream tasks. However, sharing models pre-trained on potentially sensitive data is prone to adversarial privacy attacks. In this paper, we asked to which extent we can guarantee privacy of pre-training data and, at the same time, achieve better downstream performance on legal tasks without the need of additional labeled data. We extensively experiment with scalable self-supervised learning of transformer models under the formal paradigm of differential privacy and show that under specific training configurations we can improve downstream performance without sacrifying privacy protection for the in-domain data. Our main contribution is utilizing differential privacy for large-scale pre-training of transformer language models in the legal NLP domain, which, to the best of our knowledge, has not been addressed before.
Klaus Freitag
Atharva Kulkarni, Amey Hengle, Rutuja Udyawar
The recent surge of complex attention-based deep learning architectures has led to extraordinary results in various downstream NLP tasks in the English language. However, such research for resource-constrained and morphologically rich Indian vernacular languages has been relatively limited. This paper proffers team SPPU\_AKAH's solution for the TechDOfication 2020 subtask-1f: which focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language. Availing the large dataset at hand, a hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification. Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57\% and f1-score of 0.8875. Furthermore, the solution resulted in the best system submission for this subtask, giving a test accuracy of 64.26\% and f1-score of 0.6157, transcending the performances of other teams as well as the baseline system given by the organizers of the shared task.
Halaman 27 dari 72901