J. B. Lovins
Hasil untuk "Philology. Linguistics"
Menampilkan 19 dari ~794273 hasil · dari arXiv, DOAJ, Semantic Scholar, CrossRef
J. Milroy, L. Milroy
Reem I. Masoud, Chen Feng, Shunta Asano et al.
The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.
Erik J Schlicht
Consumption of misinformation can lead to negative consequences that impact the individual and society. To help mitigate the influence of misinformation on human beliefs, algorithmic labels providing context about content accuracy and source reliability have been developed. Since the linguistic features used by algorithms to estimate information accuracy can change across time, it is important to understand their temporal dynamics. As a result, this study uses natural language processing to analyze PolitiFact statements spanning between 2010 and 2024 to quantify how the sources and linguistic features of misinformation change between five-year time periods. The results show that statement sentiment has decreased significantly over time, reflecting a generally more negative tone in PolitiFact statements. Moreover, statements associated with misinformation realize significantly lower sentiment than accurate information. Additional analysis shows that recent time periods are dominated by sources from online social networks and other digital forums, such as blogs and viral images, that contain high levels of misinformation containing negative sentiment. In contrast, most statements during early time periods are attributed to individual sources (i.e., politicians) that are relatively balanced in accuracy ratings and contain statements with neutral or positive sentiment. Named-entity recognition was used to identify that presidential incumbents and candidates are relatively more prevalent in statements containing misinformation, while US states tend to be present in accurate information. Finally, entity labels associated with people and organizations are more common in misinformation, while accurate statements are more likely to contain numeric entity labels, such as percentages and dates.
N. G. Krivulya
INTRODUCTION. As society changes, so do the forms and strategies of education, the challenges faced by educators, and the perception of their role. These changes are reflected not only in sociological research, statistical data, and media sources, but also in works of screen art. Examining the portrayal of teachers in animation is relevant in light of the transformation of their role, functions, and significance in society.MATERIALS AND METHODS. The research material includes more than 50 domestic animated films and series created from 1927 to 2024. The study employs the methods of frequency and comparative analysis, as well as contextual, thematic, and iconographic approaches, which allow for examining the iconography of images, the frequency of use of specific details, comparing images, and identifying constant and changing elements, as well as tracking the transformational changes in the portrayal of teachers in domestic animation.RESULTS AND DISCUSSION. Throughout the 20th and 21st centuries, there has been a change in the perception of the teacher's image, which has transformed from a non-personal and sacred figure to a caricatured and mocking one. The roles of the teacher have also evolved, ranging from a mentor to a guide to a new world, from a knowledge-bearing figure and an organizer of the learning process to a supervisor and a punisher for academic negligence. The gendered portrait of a teacher is changing: the image of a male teacher, which was dominant in animation until the 1970s, is gradually being replaced by the image of a female teacher. In the portrayal of teachers, animation tends to use various comedic techniques and intonations, which can range from gentle irony and parody to outright mockery.CONCLUSION. In domestic animation, the teacher's portrait is presented in a schematic manner and is based on stereotypical representations. The image of the teacher as the main character in the story has not yet been developed. In films, the teacher is almost always portrayed in secondary or episodic roles. Modern Russian animation portrays negative perceptions of teachers, which is associated with the decline in the profession's authority in society.
R. Dale, H. Moisl, H. Somers
Symbolic approaches to natural language processing tokenisation and sentence segmentation lexical analysis parsing techniques semantic analysis discourse structure and intention recognition natural language generation intelligent writing assistance database interfaces information extraction the generation of reports from databases the generation of multimedia presentations machine translation dialogue systems - from theory to practice in TRAINS-96 empirical approaches to natural language processing corpus creation for data-intensive linguistics part-of-speech tagging alignment contextual word similarity computing similarity collocations statistical parsing authorship identificaiton and computational stylometry lexical knowledge acquisition example-based machine translation word-sense disambiguation NLP based on artificial neural-networks - introduction.
Guoliang Dong, Haoyu Wang, Jun Sun et al.
By training on text in various languages, large language models (LLMs) typically possess multilingual support and demonstrate remarkable capabilities in solving tasks described in different languages. However, LLMs can exhibit linguistic discrimination due to the uneven distribution of training data across languages. That is, LLMs are hard to keep the consistency of responses when faced with the same task but depicted in different languages. In this study, we first explore the consistency in the LLMs' outputs responding to queries in various languages from two aspects: safety and quality. We conduct this analysis with two datasets (AdvBench and NQ) based on four LLMs (Llama2-13b, Gemma-7b, GPT-3.5-turbo and Gemini-pro). The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish (only 1.04\% of harmful queries successfully jailbreak on average) compared to queries in Bengali, Georgian, Nepali and Maithili (27.7\% of harmful queries jailbreak successfully on average). Moreover, for queries in English, Danish, Czech and Slovenian, LLMs tend to produce responses with a higher quality (with 0.1494 $F_1$ score on average) compared to the other languages. Upon these findings, we propose LDFighter, a similarity-based voting, to mitigate the linguistic discrimination in LLMs. LDFighter ensures consistent service for different language speakers. We evaluate LDFighter with both benign queries and harmful queries. The results show that LDFighter not only significantly reduces the jailbreak success rate but also improve the response quality on average, demonstrating its effectiveness.
Priyanka Sukumaran, Conor Houghton, Nina Kazanina
Humans can learn a new word and infer its grammatical properties from very few examples. They have an abstract notion of linguistic properties like grammatical gender and agreement rules that can be applied to novel syntactic contexts and words. Drawing inspiration from psycholinguistics, we conduct a noun learning experiment to assess whether an LSTM and a decoder-only transformer can achieve human-like abstraction of grammatical gender in French. Language models were tasked with learning the gender of a novel noun embedding from a few examples in one grammatical agreement context and predicting agreement in another, unseen context. We find that both language models effectively generalise novel noun gender from one to two learning examples and apply the learnt gender across agreement contexts, albeit with a bias for the masculine gender category. Importantly, the few-shot updates were only applied to the embedding layers, demonstrating that models encode sufficient gender information within the word embedding space. While the generalisation behaviour of models suggests that they represent grammatical gender as an abstract category, like humans, further work is needed to explore the details of how exactly this is implemented. For a comparative perspective with human behaviour, we conducted an analogous one-shot novel noun gender learning experiment, which revealed that native French speakers, like language models, also exhibited a masculine gender bias and are not excellent one-shot learners either.
Nikola Ljubešić, Taja Kuzman
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. The comparability of the corpora is ensured by a comparable crawling setup and the usage of identical crawling and post-processing technology. All the corpora were linguistically annotated with the state-of-the-art CLASSLA-Stanza linguistic processing pipeline, and enriched with document-level genre information via the Transformer-based multilingual X-GENRE classifier, which further enhances comparability at the level of linguistic annotation and metadata enrichment. The genre-focused analysis of the resulting corpora shows a rather consistent distribution of genres throughout the seven corpora, with variations in the most prominent genre categories being well-explained by the economic strength of each language community. A comparison of the distribution of genre categories across the corpora indicates that web corpora from less developed countries primarily consist of news articles. Conversely, web corpora from economically more developed countries exhibit a smaller proportion of news content, with a greater presence of promotional and opinionated texts.
Артём Александрович Красин
The article analyses the functions of abbreviation on the material of the diaries of two educated Britons written in the mid-20th century. These are the diaries of George Orwell (Eric Arthur Blair), a writer, and of Charles Dalrymple Belgrave, a diplomat. Based on the studied material, there have been found and analysed the following functions of abbreviation: the function of symbolization, the function of encryption, the function of linguistic economy, function of indicating familiarity. The diaries manifest these functions of abbreviation clearly due to the special pragmatics of this type of discourse (the focus on capturing real life events, the fact that the author and the reader are the same person, the lack of expressive tools). At the same time other functions such as the function of creating variability in style or the function of creating a comic effect are weakly manifested in these personal diaries. However, the general characteristics of the functions of abbreviation in diary discourse support the conclusions formulated on this issue by scientists from the Department of English Linguistics, Faculty of Philology, Lomonosov Moscow State University.
Juuso Eronen, Michal Ptaszynski, Fumito Masui
We study the selection of transfer languages for different Natural Language Processing tasks, specifically sentiment analysis, named entity recognition and dependency parsing. In order to select an optimal transfer language, we propose to utilize different linguistic similarity metrics to measure the distance between languages and make the choice of transfer language based on this information instead of relying on intuition. We demonstrate that linguistic similarity correlates with cross-lingual transfer performance for all of the proposed tasks. We also show that there is a statistically significant difference in choosing the optimal language as the transfer source instead of English. This allows us to select a more suitable transfer language which can be used to better leverage knowledge from high-resource languages in order to improve the performance of language applications lacking data. For the study, we used datasets from eight different languages from three language families.
A. Baikulova, A. Degaltseva
The article discusses the 85th anniversary of the Doctor of Philology, Professor of Saratov State University, M. A. Kormilitsyna. A wellknown scientist in Russia and abroad, Margarita Anatolyevna has devoted her whole life to scientific activity. Together with her teacher and friend, O. B. Sirotinina, M. A. Kormilitsyna was at the origin of Saratov Linguistic School of the functioning of the Russian language. The article highlights the main stages of the scientist’s scientific activity, presents the areas of scientific interests: syntax, stylistics and culture of the speech of the Russian language, colloquial speech, the language of the media, etc. All scientific and educational activities of Margarita Anatolyevna are based on the principles of Saratov Linguistic School, the main methodological principles of which are attentive and unbiased attitude to language, its facts, accuracy in their analysis, striving not for superficial, but for a deep understanding of the essence of linguistic phenomena. The article reveals the significance of the works of M. A. Kormilitsyna for modern linguistics: her works on colloquial speech, the language of the media, the effectiveness of communication, overcoming communicative risks, cause numerous responses in the scientific community. M. A. Kormilitsyna pays a lot of attention in her work to editorial activities (work on the interuniversity collection of scientific papers Problems of speech communication, participation in the editorial board of the journal Media Linguistics, etc.). The article reveals not only the scientific, but also the pedagogical activity of the scientist: students and teachers know Margarita Anatolyevna as a highly erudite, creatively working teacher. M. A. Kormilitsyna has been the head of the Department of the Russian Language, Speech Communication and Russian as a Foreign Language (the new name of the department) for 25 years. Colleagues respect Margarita Anatolyevna for her honesty, integrity, and friendly attitude towards people.
L. Shevchenko
The article examines the issue of changing of scientific paradigms and strategies of analysis, the formation of new research areas in modern times. Current statements of Western epistemology regarding the transformation of the scientific worldview from traditional spheres of knowledge to functional, communicative, syncretic, typical for the information age are commented. The logic of Jacqueline Rus's argument according to the radical changes in modern science, commensurate with the development of humanity and societies, is clarified. The essence of intellectual processes in current linguistics is analyzed, when a strong interest in the functional capabilities of the language has been formed since the end of the 19th century, instead of the traditional attention to the system and structural description of the language. The functional resource of language is interpreted in the article as an immanent property revealed in stylistic modifications of genres and texts. At the same time, the stylistic polyphony of texts, cognition of their communicative harmony with the purpose and objectivity of communication give rise to new research directions in linguistics – media linguistics, political and business linguistics, linguistic imagelogy, etc. Linguistic imageology as a special research field of scientific interests in philology is still being formed as a systematic direction. Syncretic in essence, linguoimageology determines the main parameters of the discipline – object, subject of analysis, strategies, tactics, methods of language activity, specific methods of text analysis, etc. in discussions and ambiguous formulations. The article presents a link for the first publication in this field – "Linguistic imageology: dictionary of terms and concepts" as a model and a possible criterion for systematic linguoimageology research. The edition conceptually forms the main categories and concepts of the direction, presents scientific definitions based on the researched locus of the functional and communicative language resource of the information world correlated with other directions of humanities. Keywords: media linguistics, linguistic imagelogy, integral links of linguistic imagelogy, strategies of linguistic imagelogy, tactics of linguistic imagelogy, information age, Jacqueline Russ. Іnformation about the author: Shevchenko Larysa Ivanivna – Corresponding Member of NAS of Ukraine, Doctor of Philology, Professor; Head of the Department of Stylistics and Language Communication; Educational and Scientific Institute of Philology; Taras Shevchenko National University of Kyiv. ScopusID
Martin Haspelmath
Communicare
D. Nguyen, A. Seza Doğruöz, C. Rosé et al.
Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of “computational sociolinguistics” that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction, and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions used in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.
Zachary Dair, Ryan Donovan, Ruairi O'Reilly
This work explores the effect of gender and linguistic-based vocal variations on the accuracy of emotive expression classification. Emotive expressions are considered from the perspective of spectral features in speech (Mel-frequency Cepstral Coefficient, Melspectrogram, Spectral Contrast). Emotions are considered from the perspective of Basic Emotion Theory. A convolutional neural network is utilised to classify emotive expressions in emotive audio datasets in English, German, and Italian. Vocal variations for spectral features assessed by (i) a comparative analysis identifying suitable spectral features, (ii) the classification performance for mono, multi and cross-lingual emotive data and (iii) an empirical evaluation of a machine learning model to assess the effects of gender and linguistic variation on classification accuracy. The results showed that spectral features provide a potential avenue for increasing emotive expression classification. Additionally, the accuracy of emotive expression classification was high within mono and cross-lingual emotive data, but poor in multi-lingual data. Similarly, there were differences in classification accuracy between gender populations. These results demonstrate the importance of accounting for population differences to enable accurate speech emotion recognition.
Sagnik Ray Choudhury, Nikita Bhutani, Isabelle Augenstein
There have been many efforts to try to understand what grammatical knowledge (e.g., ability to understand the part of speech of a token) is encoded in large pre-trained language models (LM). This is done through `Edge Probing' (EP) tests: supervised classification tasks to predict the grammatical properties of a span (whether it has a particular part of speech) using only the token representations coming from the LM encoder. However, most NLP applications fine-tune these LM encoders for specific tasks. Here, we ask: if an LM is fine-tuned, does the encoding of linguistic information in it change, as measured by EP tests? Specifically, we focus on the task of Question Answering (QA) and conduct experiments on multiple datasets. We find that EP test results do not change significantly when the fine-tuned model performs well or in adversarial situations where the model is forced to learn wrong correlations. From a similar finding, some recent papers conclude that fine-tuning does not change linguistic knowledge in encoders but they do not provide an explanation. We find that EP models themselves are susceptible to exploiting spurious correlations in the EP datasets. When this dataset bias is corrected, we do see an improvement in the EP test results as expected.
Minjin Choi, Sunkyung Lee, Eunseong Choi et al.
Automated metaphor detection is a challenging task to identify metaphorical expressions of words in a sentence. To tackle this problem, we adopt pre-trained contextualized models, e.g., BERT and RoBERTa. To this end, we propose a novel metaphor detection model, namely metaphor-aware late interaction over BERT (MelBERT). Our model not only leverages contextualized word representation but also benefits from linguistic metaphor identification theories to distinguish between the contextual and literal meaning of words. Our empirical results demonstrate that MelBERT outperforms several strong baselines on four benchmark datasets, i.e., VUA-18, VUA-20, MOH-X, and TroFi.
Halaman 44 dari 39714