Hasil "Philology. Linguistics"

arXiv Open Access 2026

Almost Clinical: Linguistic properties of synthetic electronic health records

Serge Sharoff, John Baker, David Francis Hunt et al.

This study evaluates the linguistic and clinical suitability of synthetic electronic health records in mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we examine expressions of agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) with the aim to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures. The results show both the potential and limitations of synthetic corpora for enabling large-scale linguistic research otherwise impossible with genuine patient records.

en cs.CL

Detail Sumber

arXiv Open Access 2025

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet, Leonie Weissweiler, Joakim Nivre et al.

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Modeling Professionalism in Expert Questioning through Linguistic Differentiation

Giulia D'Agostino, Chung-Chi Chen

Professionalism is a crucial yet underexplored dimension of expert communication, particularly in high-stakes domains like finance. This paper investigates how linguistic features can be leveraged to model and evaluate professionalism in expert questioning. We introduce a novel annotation framework to quantify structural and pragmatic elements in financial analyst questions, such as discourse regulators, prefaces, and request types. Using both human-authored and large language model (LLM)-generated questions, we construct two datasets: one annotated for perceived professionalism and one labeled by question origin. We show that the same linguistic features correlate strongly with both human judgments and authorship origin, suggesting a shared stylistic foundation. Furthermore, a classifier trained solely on these interpretable features outperforms gemini-2.0 and SVM baselines in distinguishing expert-authored questions. Our findings demonstrate that professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.

en cs.CL

Detail Sumber

CrossRef Open Access 2025

Language representations of the promotion strategy: metadiscourse markers in academic articles on linguistics

Ruonan Zhao

The article is devoted to the study of metadiscourse markers as a tool for implementing the promotion strategy in Russian-language academic linguistic discourse. The aim of the work is to identify the features of the application and role of metadiscourse markers based on articles published in the leading Russian journal “Acta Linguistica Petropolitana”. The theoretical basis of the study was K. Hyland’s metadiscourse model, which views an academic text as a space for interaction between the author and the reader. The scientific novelty lies in the fact that a comprehensive analysis of the means of implementing the promotion strategy in the corpus of the specified linguistic journal is carried out for the first time, which makes it possible to identify trends characteristic of modern Russian academic rhetoric. By employing quantitative analysis, metadiscourse markers were classified and the most frequent linguistic means used by authors to attract the attention of readers, position the novelty and significance of their work, and present the results convincingly were identified. It has been proven that metadiscourse markers are an integral component of the promotion strategy, contributing to effective scientific communication.

en

Detail DOI Sumber

CrossRef Open Access 2025

Sound (grapheme)-color associativity in financial branding: linguistic marketing or theoretical linguistics

Svetlana Sergeevna Shlyakhova

The purpose of the study is to establish the specifics of the congruence of sound (grapheme)-color associativity of verbal and color visual elements of the brand. The novelty of the research: for the first time, a brand is understood as a multimodal/creolized text, the effectiveness of which is related to the congruence of cross-modal correspondences in various modes of encoding brand information. For the first time, the congruence of the sound (grapheme)-color associativity of the brand name and the color of the brand logo is compared. The results showed that banks that have ceased to exist have the lowest degree of congruence of the sound (grapheme)-color associativity of verbal and visual brand elements (≈30%); operating banks have the highest degree (≈70%); financial pyramids occupy an intermediate position (≈62%). Statistical analysis suggests that these coincidences are not accidental. The rebranding of logos at the same time as the new name of the financial institution leads to increased congruence between the verbal and visual elements of the brand. The data obtained offer promising material for theoretical (sound symbolism) and applied (linguistic marketing) linguistics.

en

Detail DOI Sumber

DOAJ Open Access 2025

„Oszillieren“ zwischen Kreativität und Routine? Zur körperlichen Performativität des Kreativen im alltäglichen Fremdsprachenunterricht

Matthias Grein

Das Verhältnis von Kreativität und Routine bleibt in fremdsprachendidaktischen Publikationen eher unscharf und wird mit Oszillieren beschrieben. Fremdsprachendidaktischen Erwartungen an kreativen, körperlich aktivierenden (Literatur-)Unterricht stellt der Beitrag ein praxistheoretisches Verständnis von Kreativität, Routine und Körperlichkeit entgegen. Demzufolge sind die Begriffe grundlagentheoretisch als miteinander verstrickt und nicht als gegensätzlich zu verstehen; empirische Ergebnisse verweisen auf spezifische spätmoderne Erwartungen von Kreativität sowie auf Routinen, die als notwendig für Unterricht betrachtet werden.Anhand von videographischen Beispielen aus dem Französischunterricht wird gezeigt, wie sehr Kreativität und Routine ineinander verwoben sind und zudem von der eingenommenen Perspektive abhängen. So wirkt nicht jedes kreative Handeln im Unterricht als fachdidaktisch wünschenswert; die manchmal didaktisch fragwürdig erscheinenden Praktiken sind aber innerhalb der Institution Schule und ihrer Zwänge funktional. Zugleich wird auch sichtbar, dass fachdidaktische Normen ein notwendiger Bewertungsmaßstab für Fachunterricht bleiben.Insgesamt zeigt sich Kreativität weniger als eindeutige didaktische Programmatik, sondern als soziale Notwendigkeit im Unterrichtsalltag, die didaktisch gestaltet werden sollte, da sie sonst auf didaktisch nicht erwünschte Weise auftritt.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2024

Swap distance minimization beyond entropy minimization in word order variation

Víctor Franco-Sánchez, Arnau Martí-Llobet, Ramon Ferrer-i-Cancho

Consider a linguistic structure formed by $n$ elements, for instance, subject, direct object and verb ($n=3$) or subject, direct object, indirect object and verb ($n=4$). We investigate whether the frequency of the $n!$ possible orders is constrained by two principles. First, entropy minimization, a principle that has been suggested to shape natural communication systems at distinct levels of organization. Second, swap distance minimization, namely a preference for word orders that require fewer swaps of adjacent elements to be produced from a source order. We present average swap distance, a novel score for research on swap distance minimization. We find strong evidence of pressure for entropy minimization and swap distance minimization with respect to a die rolling experiment in distinct linguistic structures with $n=3$ or $n=4$. Evidence with respect to a Polya urn process is strong for $n=4$ but weaker for $n=3$. We still find evidence consistent with the action of swap distance minimization when word order frequencies are shuffled, indicating that swap distance minimization effects are beyond pressure to reduce word order entropy.

en cs.CL, physics.soc-ph

Detail Sumber

arXiv Open Access 2024

CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean

Eunsu Kim, Juyoung Suk, Philhoon Oh et al.

Despite the rapid development of large language models (LLMs) for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets that are sourced from Korean data capturing cultural knowledge, only narrow tasks such as bias and hate speech detection are offered. To address this gap, we introduce a benchmark of Cultural and Linguistic Intelligence in Korean (CLIcK), a dataset comprising 1,995 QA pairs. CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture. For each instance in CLIcK, we provide fine-grained annotation of which cultural and linguistic knowledge is required to answer the question correctly. Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension. CLIcK offers the first large-scale comprehensive Korean-centric analysis of LLMs' proficiency in Korean culture and language.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Linguistic Knowledge Can Enhance Encoder-Decoder Models (If You Let It)

Alessio Miaschi, Felice Dell'Orletta, Giulia Venturi

In this paper, we explore the impact of augmenting pre-trained Encoder-Decoder models, specifically T5, with linguistic knowledge for the prediction of a target task. In particular, we investigate whether fine-tuning a T5 model on an intermediate task that predicts structural linguistic properties of sentences modifies its performance in the target task of predicting sentence-level complexity. Our study encompasses diverse experiments conducted on Italian and English datasets, employing both monolingual and multilingual T5 models at various sizes. Results obtained for both languages and in cross-lingual configurations show that linguistically motivated intermediate fine-tuning has generally a positive impact on target task performance, especially when applied to smaller models and in scenarios with limited data availability.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

Abhinav Patil, Jaap Jumelet, Yu Ying Chiu et al.

This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2024

Preservation of Health - constructive or destructive „Interference“

Ivana Marasović Šušnjara, Maja Vejić, Julija Vejić

According to the World Health Organization, health is one of the fundamental rights of every human being, regardless of race, religion, political beliefs, economic and social status. Although we tend to believe it, health is not self-evident and is also conditioned by the factors beyond our control. Today, we share many leading public health problems globally. In solving them, various interventions are carried out, often with the absence of targeted results. Improving and preserving health is a big challenge and is the responsibility of the individual, the health system and other participants and is the result of their interaction. But, is the action of these participants coherent? Is their interaction constructive? What to do to prevent destruction? These are just some of the questions that should be answered in order for interventions to adequately respond to the challenges.

Communication. Mass media, Philosophy. Psychology. Religion

Detail DOI Sumber

arXiv Open Access 2023

Analyzing Hong Kong's Legal Judgments from a Computational Linguistics point-of-view

Sankalok Sen

Analysis and extraction of useful information from legal judgments using computational linguistics was one of the earliest problems posed in the domain of information retrieval. Presently, several commercial vendors exist who automate such tasks. However, a crucial bottleneck arises in the form of exorbitant pricing and lack of resources available in analysis of judgements mete out by Hong Kong's Legal System. This paper attempts to bridge this gap by providing several statistical, machine learning, deep learning and zero-shot learning based methods to effectively analyze legal judgments from Hong Kong's Court System. The methods proposed consists of: (1) Citation Network Graph Generation, (2) PageRank Algorithm, (3) Keyword Analysis and Summarization, (4) Sentiment Polarity, and (5) Paragrah Classification, in order to be able to extract key insights from individual as well a group of judgments together. This would make the overall analysis of judgments in Hong Kong less tedious and more automated in order to extract insights quickly using fast inferencing. We also provide an analysis of our results by benchmarking our results using Large Language Models making robust use of the HuggingFace ecosystem.

en cs.CL, cs.IR

Detail Sumber

arXiv Open Access 2023

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

Boqiang Zhang, Hongtao Xie, Yuxin Wang et al.

Vision model have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. However, due to lacking the perception of linguistic knowledge and information, recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. (2) the visual feature is suboptimal for the recognition in some vision-missing cases (e.g. occlusion, etc.). To address these issues, we propose a $\textbf{L}$inguistic $\textbf{P}$erception $\textbf{V}$ision model (LPV), which explores the linguistic capability of vision model for accurate text recognition. To alleviate the LID problem, we introduce a Cascade Position Attention (CPA) mechanism that obtains high-quality and accurate attention maps through step-wise optimization and linguistic information mining. Furthermore, a Global Linguistic Reconstruction Module (GLRM) is proposed to improve the representation of visual features by perceiving the linguistic information in the visual space, which gradually converts visual features into semantically rich ones during the cascade process. Different from previous methods, our method obtains SOTA results while keeping low complexity (92.4% accuracy with only 8.11M parameters). Code is available at https://github.com/CyrilSterling/LPV.

en cs.CV

Detail Sumber

arXiv Open Access 2023

Disentangling the Linguistic Competence of Privacy-Preserving BERT

Stefan Arnold, Nils Kemmerzell, Annika Schreiner

Differential Privacy (DP) has been tailored to address the unique challenges of text-to-text privatization. However, text-to-text privatization is known for degrading the performance of language models when trained on perturbed text. Employing a series of interpretation techniques on the internal representations extracted from BERT trained on perturbed pre-text, we intend to disentangle at the linguistic level the distortion induced by differential privacy. Experimental results from a representational similarity analysis indicate that the overall similarity of internal representations is substantially reduced. Using probing tasks to unpack this dissimilarity, we find evidence that text-to-text privatization affects the linguistic competence across several formalisms, encoding localized properties of words while falling short at encoding the contextual relationships between spans of words.

en cs.CL

Detail Sumber

CrossRef Open Access 2023

Theory of Mundane Linguistics: Systematic Review of the Current Research Direction

Mikhail Aleksandrovitch Kravchenko

This study shows the results of the scientific review on the theory of mundane linguistics which ontological grounds lie in the reflexivity of consciousness and interpretive activity of native speakers. The aim of the study is to define the current state of the theory of mundane linguistics and possible perspectives of developing this scientific field. The novelty of the study is conditioned by the undertaken analysis of the actual state of the theory of mundane linguistics and the identification of the perspectives of developing this scientific field. It is the first time that the latest researches in the field of mundane linguistics theory are systematically reviewed on their differentiation according to the aspect of studying the problems and peculiarities of empirical basis; the perspectives of developing this scientific field are considered in phase of a possible methodological breakthrough connected with the increasing influence of neo-structuralism ideas. As a result, the prediction of the two-way development of mundane linguistics theory is formulated. The ‘inertial’ direction will be due to the support and enlargement of the established scientific tradition and the ‘dynamic’ direction implies the changing of methodological framework following the birth of new ideas.

en

Detail DOI Sumber

DOAJ Open Access 2023

Maghrebians (or Mauri) speak Magharibi, not Arabic

Abdou Elimam

After defeating and pushing aside the Carthaginians, the Romans named the territory of north Africa Mauretania and its population, Mauri. Later on, historians have had to reflect on the origin of the term mauri to designate the population or Mauretania to name the territory. Besides the signification of the Latin/Greek word « mauri » which means « black », it is admitted, as well, that the word Mauri is a transcription, into the Latin alphabet, of a punic word meaning « the west » or « the westerners ». If the latter meaning is 2000 years old, the other approbation is relatively modern and suspected of ideological biases. The punic word maġaribis, transcribed mauri/ma’ari/mahauri, really makes sense since it means the west in Punic - the 15 century-long North African lingua franca. Received: 11 February 2023 Accepted: 21 February 2023 Published: 20 March 2023

Language. Linguistic theory. Comparative grammar, Communication. Mass media

Detail DOI Sumber

DOAJ Open Access 2023

Mit antiken Texten politisch denken und urteilen lernen

Jochen Sauer

Greek language and literature. Latin language and literature, Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2022

Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings

Yiren Jian, Chongyang Gao, Soroush Vosoughi

Semantic representation learning for sentences is an important and well-studied problem in NLP. The current trend for this task involves training a Transformer-based sentence encoder through a contrastive objective with text, i.e., clustering sentences with semantically similar meanings and scattering others. In this work, we find the performance of Transformer models as sentence encoders can be improved by training with multi-modal multi-task losses, using unpaired examples from another modality (e.g., sentences and unrelated image/audio data). In particular, besides learning by the contrastive loss on text, our model clusters examples from a non-linguistic domain (e.g., visual/audio) with a similar contrastive loss at the same time. The reliance of our framework on unpaired non-linguistic data makes it language-agnostic, enabling it to be widely applicable beyond English NLP. Experiments on 7 semantic textual similarity benchmarks reveal that models trained with the additional non-linguistic (images/audio) contrastive objective lead to higher quality sentence embeddings. This indicates that Transformer models are able to generalize better by doing a similar task (i.e., clustering) with unpaired examples from different modalities in a multi-task fashion.

en cs.CL

Detail Sumber

arXiv Open Access 2021

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan et al.

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

en cs.CL, cs.CV

Detail DOI Sumber

DOAJ Open Access 2021

A trilingual linguistic dictionary for students of Slavic languages (On Nadezhda Stalyanova, Elena Krejčová. Rechnik na lingvistichnite termini za studenti slavisti…)

Daniela Konstantinova

Philology. Linguistics, Literature (General)

Detail DOI Sumber

Hasil untuk "Philology. Linguistics"