K. Wexler, P. Culicover
Hasil untuk "Language. Linguistic theory. Comparative grammar"
Menampilkan 20 dari ~4432503 hasil · dari DOAJ, arXiv, Semantic Scholar, CrossRef
H. Goodluck
Yuchen Zhang, Ravi Shekhar, Haralambos Mouratidis
Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.
Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar et al.
Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.
Niccolò Scaffai
In this issue of Between, which focuses on post-apocalyptic narratives, the In Discussion column is dedicated to several recent novels that place images of disaster in the foreground or background, but always in relation to the realistic portrayal of the daily lives of individuals and communities. The works in question are The Deluge by Stephen Markley (2022), The Bee Sting by Paul Murray (2023) and What We Can Now by Ian McEwan (2025).
Zheng-Lin Lin, Yu-Fei Shih, Shu-Kai Hsieh
This paper investigates the utilization of Large Language Models (LLMs) for solving complex linguistic puzzles, a domain requiring advanced reasoning and adept translation capabilities akin to human cognitive processes. We explore specific prompting techniques designed to enhance ability of LLMs to reason and elucidate their decision-making pathways, with a focus on Input-Output Prompting (IO), Chain-of-Thought Prompting (CoT), and Solo Performance Prompting (SPP). Utilizing datasets from the Puzzling Machine Competition and various Linguistics Olympiads, we employ a comprehensive set of metrics to assess the performance of GPT-4 0603, a prominent LLM, across these prompting methods. Our findings illuminate the potential of LLMs in linguistic reasoning and complex translation tasks, highlighting their capabilities and identifying limitations in the context of linguistic puzzles. This research contributes significantly to the broader field of Natural Language Processing (NLP) by providing insights into the optimization of LLM applications for improved reasoning and translation accuracy, thereby enriching the ongoing dialogue in NLP advancements.
Alicia DeVrio, Myra Cheng, Lisa Egede et al.
Recent attention to anthropomorphism -- the attribution of human-like qualities to non-human objects or entities -- of language technologies like LLMs has sparked renewed discussions about potential negative impacts of anthropomorphism. To productively discuss the impacts of this anthropomorphism and in what contexts it is appropriate, we need a shared vocabulary for the vast variety of ways that language can be anthropomorphic. In this work, we draw on existing literature and analyze empirical cases of user interactions with language technologies to develop a taxonomy of textual expressions that can contribute to anthropomorphism. We highlight challenges and tensions involved in understanding linguistic anthropomorphism, such as how all language is fundamentally human and how efforts to characterize and shift perceptions of humanness in machines can also dehumanize certain humans. We discuss ways that our taxonomy supports more precise and effective discussions of and decisions about anthropomorphism of language technologies.
Enora Rice, Ali Marashian, Hannah Haynie et al.
Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity, yet selecting a suitable transfer language remains a challenge. The precise roles of linguistic typology, training data, and model architecture in transfer language choice are not fully understood. We take a holistic approach, examining how both dataset-specific and fine-grained typological features influence transfer language selection for part-of-speech tagging, considering two different sources for morphosyntactic features. While previous work examines these dynamics in the context of bilingual biLSTMS, we extend our analysis to a more modern transfer learning pipeline: zero-shot prediction with pretrained multilingual models. We train a series of transfer language ranking systems and examine how different feature inputs influence ranker performance across architectures. Word overlap, type-token ratio, and genealogical distance emerge as top features across all architectures. Our findings reveal that a combination of typological and dataset-dependent features leads to the best rankings, and that good performance can be obtained with either feature group on its own.
Jarosław A. Chudziak, Michał Wawer
This paper presents ElliottAgents, a multi-agent system leveraging natural language processing (NLP) and large language models (LLMs) to analyze complex stock market data. The system combines AI-driven analysis with the Elliott Wave Principle to generate human-comprehensible predictions and explanations. A key feature is the natural language dialogue between agents, enabling collaborative analysis refinement. The LLM-enhanced architecture facilitates advanced language understanding, reasoning, and autonomous decision-making. Experiments demonstrate the system's effectiveness in pattern recognition and generating natural language descriptions of market trends. ElliottAgents contributes to NLP applications in specialized domains, showcasing how AI-driven dialogue systems can enhance collaborative analysis in data-intensive fields. This research bridges the gap between complex financial data and human understanding, addressing the need for interpretable and adaptive prediction systems in finance.
Zhivar Sourati, Farzan Karimi-Malekabadi, Meltem Ozcan et al.
Language is far more than a communication tool. A wealth of information - including but not limited to the identities, psychological states, and social contexts of its users - can be gleaned through linguistic markers, and such insights are routinely leveraged across diverse fields ranging from product development and marketing to healthcare. In four studies utilizing experimental and observational methods, we demonstrate that the widespread adoption of large language models (LLMs) as writing assistants is linked to notable declines in linguistic diversity and may interfere with the societal and psychological insights language provides. We show that while the core content of texts is retained when LLMs polish and rewrite texts, not only do they homogenize writing styles, but they also alter stylistic elements in a way that selectively amplifies certain dominant characteristics or biases while suppressing others - emphasizing conformity over individuality. By varying LLMs, prompts, classifiers, and contexts, we show that these trends are robust and consistent. Our findings highlight a wide array of risks associated with linguistic homogenization, including compromised diagnostic processes and personalization efforts, the exacerbation of existing divides and barriers to equity in settings like personnel selection where language plays a critical role in assessing candidates' qualifications, communication skills, and cultural fit, and the undermining of efforts for cultural preservation.
Yufei Gao, Jiaying Fei, Nuo Chen et al.
Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.
Jianfei Chen
The current study investigates the use of a dramatic text to teach spoken Chinese grammar in casual conversation. By analyzing the language used in The Teahouse (1957), a classic Chinese play, this study identifies features of spoken Chinese grammar with a focus on the distinct structures of functionally motivated casual conversation. Utilizing Cullen and Kuo’s categorization of spoken grammar (2007), the analysis offers a new perspective on spoken Chinese, highlighting the interpersonal nature of casual conversation and its implications for teaching spoken grammar. The findings indicate that The Teahouse can serve as a valuable resource for presenting advanced spoken Chinese to students, providing rich models and contextual knowledge. This suggests that incorporating the play as an authentic source of casual conversation in class may significantly enhance students’ understanding and production of spoken Chinese grammar in a meaningful way.
Claudia Cagninelli
This paper focuses on the circulation of digital discourses and their virality in relation to the influence of the technological environment and its affordances. The aim is to identify linguistic and (techno)discursive features that correspond to different degrees of discourse virality. The study is carried out on a thematic corpus of tweets, distinguishing four main levels of discourse virality. Adopting a discursive approach, this study analyzes the linguistic and discursive features that are typical of each level and that could influence the potential dimension of discourse virality.
Canaan Breiss, Alexis Ross, Amani Maina-Kilaas et al.
We propose an interactive approach to language learning that utilizes linguistic acceptability judgments from an informant (a competent language user) to learn a grammar. Given a grammar formalism and a framework for synthesizing data, our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies, asks the informant for a binary judgment, and updates its own parameters in preparation for the next query. We demonstrate the effectiveness of our model in the domain of phonotactics, the rules governing what kinds of sound-sequences are acceptable in a language, and carry out two experiments, one with typologically-natural linguistic data and another with a range of procedurally-generated languages. We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, and sometimes greater than, fully supervised approaches.
Dang Nguyen, Jiuhai Chen, Tianyi Zhou
Large language models (LLMs), despite their breakthroughs on many challenging benchmark tasks, lean to generate verbose responses and lack the controllability of output complexity, which is usually preferred by human users in practice. In this paper, we study how to precisely control multiple linguistic complexities of LLM output by finetuning using off-the-shelf data. To this end, we propose multi-control tuning (MCTune), which includes multiple linguistic complexity values of ground-truth responses as controls in the input for instruction tuning. We finetune LLaMA2-7B on Alpaca-GPT4 and WizardLM datasets. Evaluations on widely used benchmarks demonstrate that our method does not only improve LLMs' multi-complexity controllability substantially but also retains or even enhances the quality of the responses as a side benefit.
Zhuang Qiu, Xufeng Duan, Zhenguang G. Cai
Large language models (LLMs) have demonstrated exceptional performance across various linguistic tasks. However, it remains uncertain whether LLMs have developed human-like fine-grained grammatical intuition. This preregistered study (https://osf.io/t5nes) presents the first large-scale investigation of ChatGPT's grammatical intuition, building upon a previous study that collected laypeople's grammatical judgments on 148 linguistic phenomena that linguists judged to be grammatical, ungrammatical, or marginally grammatical (Sprouse, Schutze, & Almeida, 2013). Our primary focus was to compare ChatGPT with both laypeople and linguists in the judgement of these linguistic constructions. In Experiment 1, ChatGPT assigned ratings to sentences based on a given reference sentence. Experiment 2 involved rating sentences on a 7-point scale, and Experiment 3 asked ChatGPT to choose the more grammatical sentence from a pair. Overall, our findings demonstrate convergence rates ranging from 73% to 95% between ChatGPT and linguists, with an overall point-estimate of 89%. Significant correlations were also found between ChatGPT and laypeople across all tasks, though the correlation strength varied by task. We attribute these results to the psychometric nature of the judgment tasks and the differences in language processing styles between humans and LLMs.
Nathan A. Chi, Teodor Malchev, Riley Kong et al.
We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.
Mikail Bülbül
Ji bo standardîzasyona zimanekî çend qonaxên sereke hene ku hilbijartina herêmeka zimanî, grafîtîzasyon, kodîfîkasyona normên rêzimanî û ferhengî û modernîzasyon hin ji wan in. Amadekirina ferhengeka ortografîk anku standard jî yek ji karên sereke ye ku dikeve nav xebatên kodîfîkasyona zimanekî. Di vê xebata xwe da em dê li ser prensîpên amadekirina ferhengeka ortografîk ji bo kurdîya kurmancî bisekinin. Di kurdîya kurmancî da jî wekî gelek ziman û dîyalektên zimanên din gelek varyantên bêjeyan hene. Ji bo amadekirina ferhengeka standard ji nav varyantên bêjeyekê hilbijartina formekê karekî hem giring hem jî hesas e. Ji bo xebateka weha divê pêşî varyantên bêjeyan ji devokên cuda bên berhevkirin û lijeneyeka biryarder bê sazkirin. Ew lijneya biryarder ji nav varyantên bêjeyan divê yekê ji bo kurmancîya standard hilbijêre. Ji bo ku bikare biryarê bide jî divê meylên biryardanê tesbît bike. Berbelavîya formeka varyantekê, hevbeşbûna navbera dîyalektan, forma zimanê deynkirî, forma kurdîbûyî meylên giring in ji bo biryardana li ser varyantekê. Armanca vê xebatê jî ew e ku ji bo ferhengeka standard a kurdîya kurmancî prensîpên sereke yên ferhengeka ortografîk nîqaş û destnîşan bike.
Mariangela Cerullo, Paolo Izzo
This paper presents the results of an intensive research on the phenomenon of split intransitivity in Campanian varieties. In the first part it presents the phenomenon according to the different theoretical approaches considered, in order to analyse the considerable amount of data presented in the following sections. The second part is devoted to the central Campanian varieties, in which the data from the metropolitan area of Naples, Salerno and the hinterland are exposed. The peculiarity of the Acerno variety, which is more isolated than the others, leads to present the data of this variety separately. The last part presents the data from the south of the region, the Cilento, divided into two different areas, which are very different from a sociolinguistic point of view. For each dialectal area, an attempt is made to identify the internal parameters that characterise the variation in the choice of the auxiliary and, where necessary, the authors recourse to sociolinguistic variation or contact among different varieties that may have had an influence on the phenomenon.
Gabriel Borowski
Halaman 16 dari 221626