Simone Marques dos Santos Ribeiro, Renato Rodrigues-Pereira
Dentre o universo lexical da língua portuguesa, destacamos a aprendizagem das unidades fraseológicas (UF), sobretudo as expressões idiomáticas (EI), por constituírem unidades lexicais complexas, cujo significado global não corresponde à soma individual dos componentes da expressão, causando, assim, possíveis dúvidas aos consulentes. Considerando a importância do dicionário escolar como material didático funcional no processo de ensino e de aprendizagem das EI é que nos propusemos a desenvolver este trabalho. Desse modo, este artigo tem por objetivo apresentar resultados da análise que realizamos sobre a lematização dispensada às EI em dois dicionários escolares do tipo 4 do Programa Nacional do Livro Didático (PNLD/2012), a saber: Aulete (Geiger, 2011) e Unesp (Borba, 2011).Para tanto, ao orientarmo-nos por princípios teóricos e metodológicos da Lexicografia Pedagógica, da Fraseologia e da Fraseografia, em especial nas contribuições de Corpas Pastor (1996), Santamaría Pérez (2000) Xatara (1998), Olímpio de Oliveira Silva (2011), Krieger (2011), Ortíz Álvarez (2000; 2011), Welker (2011), Rodrigues-Pereira (2020), entre outros, estabelecemos os seguintes objetivos específicos: i) examinar as informações acerca do registro das EI no frot matter dos dois dicionários, como forma de verificar se há coerência entre as informações disponíveis nesta parte das obras e os registros na macroestrutura; ii) verificar se há registros de variantes e elementos do contorno das EI e se houver de que forma são apresentadas nas obras analisadas; e iii) comparar a lematização das EI registradas nos dois repertórios lexicográficos selecionados para a análise, com vistas a identificar parâmetros para o registro das EI. As análises evidenciam heterogeneidade na lematização das expressões, com diferentes estratégias de seleção para inserção da EI na microestrutura. Ambos os dicionários incluem variantes e elementos do contorno, mas com procedimentos distintos, o que impacta na identificação dos componentes fixos e variáveis. Tanto pelas análises realizadas, como pela revisão bibliográfica, identificamos que o tratamento lexicográfico referente às EI ainda carece de mais reflexões metalexicográficas que venham a possibilitar repertórios lexicográficos organizados de forma mais didática.
Alzheimer's Disease (AD) dementia is a progressive neurodegenerative disease that negatively impacts patients' cognitive ability. Previous studies have demonstrated that changes in naturalistic language samples can be useful for early screening of AD dementia. However, the nature of language deficits often requires test administrators to use various speech elicitation techniques during spontaneous language assessments to obtain enough propositional utterances from dementia patients. This could lead to the ``observer's effect'' on the downstream analysis that has not been fully investigated. Our study seeks to quantify the influence of test administrators on linguistic features in dementia assessment with two English corpora the ``Cookie Theft'' picture description datasets collected at different locations and test administrators show different levels of administrator involvement. Our results show that the level of test administrator involvement significantly impacts observed linguistic features in patient speech. These results suggest that many of significant linguistic features in the downstream classification task may be partially attributable to differences in the test administration practices rather than solely to participants' cognitive status. The variations in test administrator behavior can lead to systematic biases in linguistic data, potentially confounding research outcomes and clinical assessments. Our study suggests that there is a need for a more standardized test administration protocol in the development of responsible clinical speech analytics frameworks.
Sankalp Tattwadarshi Swain, Anshika Krishnatray, Dhruv Kumar
et al.
Existing evaluation studies on linguistic competence of large language models (LLM agents) have focused primarily on vocabulary learning, morphological rule induction, syntactic generalization, pragmatic inference, and cross-linguistic transfer. However, none assess whether LLM agents can acquire a language through pattern recognition and interactive feedback, a central feature of human language acquisition. We propose a novel experimental framework in which an LLM agent is evaluated on its ability to acquire and use a newly constructed language (Tinkatongue) in conversation with a bot that understands only Tinkatongue. Our findings show that LLM agents fail to establish a conversation within 100 responses, yet they adopt distinct strategies that mirror human approaches to language learning. The results suggest a new direction for evaluation benchmarks and open pathways to model designs that learn more effectively from interactive feedback.
The aim of the paper is to conduct a brief theoretical and analytical review of the works of Russian, European and Chinese linguists in the field of analysing political media discourse and the methodology of its study. The scientific novelty of this review is determined by the fact that the results of research of various methods and techniques of studying political media discourse in Russian, European and Chinese linguistics are presented, which gives the most complete picture of the development and functioning of political communication. The results of the review and analysis of scientific works devoted to political media discourse in Russian, European and Chinese linguistics allow us to confirm the relevance and novelty of the study of political communication; moreover, they are a necessary basis for a comprehensive study of political discourse. As a result, it was found that the study of political communication, namely the methodology of its study, the peculiarities of the language of politics is given much attention, both in the global space and within our state. Hence there is a need for further interdisciplinary study of political communication, development of discourse analysis, synthesis of knowledge already accumulated in the field of political linguistics and their practical application in order to improve the effectiveness of speech interaction not only in the political sphere, but also in other spheres of social life.
En esta sección de la revista Ñawi podemos encontrar trabajos de investigación que abordan los principios fundamentales del diseño y la gestión de marcas. Se reflexiona sobre la utilización de los múltiples elementos visuales que intervienen a la hora de construir mensajes capaces de transmitir valores y emociones. También se examina la manera en que se comunica una identidad de marca a través de diferentes canales. Y asimismo se explora el mundo del branding experiencial, pues las marcas buscan generar experiencias memorables y significativas para sus potenciales públicos. En esta ocasión, tendremos la oportunidad de leer dos interesantes artículos. En uno de ellos, se propone un análisis de las representaciones visuales de una marca de cerveza mexicana, pues en la evolución de dichas representaciones aparecen reflejadas diversas modulaciones de la identidad de lo mexicano. Con ello, puede constatarse que en los componentes visuales de una determinada marca comercial aparecen plasmados, en ocasiones, importantes rasgos de identidad cultural. En el otro trabajo, comprobaremos que el proceso de creación de una marca también puede ponerse al servicio de proyectos que tienen un enfoque más social y un arraigo en lo local.
Drawing. Design. Illustration, Communication. Mass media
Inspired by cognitive neuroscience studies, we introduce a novel `decoding probing' method that uses minimal pairs benchmark (BLiMP) to probe internal linguistic characteristics in neural language models layer by layer. By treating the language model as the `brain' and its representations as `neural activations', we decode grammaticality labels of minimal pairs from the intermediate layers' representations. This approach reveals: 1) Self-supervised language models capture abstract linguistic structures in intermediate layers that GloVe and RNN language models cannot learn. 2) Information about syntactic grammaticality is robustly captured through the first third layers of GPT-2 and also distributed in later layers. As sentence complexity increases, more layers are required for learning grammatical capabilities. 3) Morphological and semantics/syntax interface-related features are harder to capture than syntax. 4) For Transformer-based models, both embeddings and attentions capture grammatical features but show distinct patterns. Different attention heads exhibit similar tendencies toward various linguistic phenomena, but with varied contributions.
Speech is a multiplexed signal displaying levels of complexity, organizational principles and perceptual units of analysis at distinct timescales. This critical acoustic signal for human communication is thus characterized at distinct representational and temporal scales, related to distinct linguistic features, from acoustic to supra-lexical. This chapter presents an overview of experimental work devoted to the characterization of the speech signal at different timescales, beyond its acoustic properties. The functional relevance of these different levels of analysis for speech processing is discussed. We advocate that studying speech perception through the prism of multi-time scale representations effectively integrates work from various research areas into a coherent picture and contributes significantly to increase our knowledge on the topic. Finally, we discuss how these experimental results fit with neural data and current dynamical models of speech perception.
We present LingGen, a controlled text generation model that allows fine-grained control over a large number of real-valued linguistic attributes. It encodes target attribute values with a dedicated linguistic attribute encoder and conditions the language model by injecting the resulting representation into the language model using the beginning-of-sequence (BOS) embeddings. To improve robustness when controlling different attribute subsets, we introduce P-MASKING, which samples per-example attribute masking rates from a truncated Pareto distribution during training. Across 1-40 control attributes, LingGen achieves the lowest average control error among evaluated methods, while remaining efficient at inference and receiving the highest fluency scores in human evaluation. Ablations show that Pareto-sampled masking and BOS-based injection are effective choices compared to alternative masking and integration variants.
Annika Marie Schoene, John E. Ortega, Rodolfo Joel Zevallos
et al.
Recent years have seen a marked increase in research that aims to identify or predict risk, intention or ideation of suicide. The majority of new tasks, datasets, language models and other resources focus on English and on suicide in the context of Western culture. However, suicide is global issue and reducing suicide rate by 2030 is one of the key goals of the UN's Sustainable Development Goals. Previous work has used English dictionaries related to suicide to translate into different target languages due to lack of other available resources. Naturally, this leads to a variety of ethical tensions (e.g.: linguistic misrepresentation), where discourse around suicide is not present in a particular culture or country. In this work, we introduce the 'Lexicography Saves Lives Project' to address this issue and make three distinct contributions. First, we outline ethical consideration and provide overview guidelines to mitigate harm in developing suicide-related resources. Next, we translate an existing dictionary related to suicidal ideation into 200 different languages and conduct human evaluations on a subset of translated dictionaries. Finally, we introduce a public website to make our resources available and enable community participation.
This study investigates the use of hyperbole as a satirical device in English-language science fiction, examining its stylistic conveyance and translation into Russian. Employing linguo-stylistic analysis, translation analysis, and functional-linguistic component examination, various definitions of hyperbole are explored. The results indicate that hyperbole serves as an artistic tool for deliberate exaggeration, capturing the reader’s attention and emphasizing the author’s stance. Focusing on Harry Harrison’s science fiction novel “Bill, the Galactic Hero” and its two translations by V. P. Kovalevsky, the research scrutinizes hyperbole’s contribution to satirical effects and identifies two types of hyperbolic exaggerations: contextual and hidden. Hidden hyperbole is further expressed through stylistic contrast and enumeration. In the Russian translations, the satirical effect is largely preserved, albeit with some losses. The study underscores the significance of hyperbole in engaging readers’ critical thinking and directing their focus towards real-world issues within the science fiction narrative. For professional translators, recognizing and conveying hyperbole is crucial in accurately representing the author’s intended meaning.
Theory and practice of education, Philology. Linguistics
Dark patterns are deceptive and malicious properties of user interfaces that lead the end-user to do something different from intended or expected. While now a key topic in critical computing, most work has been conducted in Western contexts. Japan, with its booming app market, is a relatively uncharted context that offers culturally- and linguistically-sensitive differences in design standards, contexts of use, values, and language, all of which could influence the presence and expression of dark patterns. In this work, we analyzed 200 popular mobile apps in the Japanese market. We found that most apps had dark patterns, with an average of 3.9 per app. We also identified a new class of dark pattern: "Linguistic Dead-Ends" in the forms of "Untranslation" and "Alphabet Soup." We outline the implications for design and research practice, especially for future cross-cultural research on dark patterns.
Abstract One of the biggest challenges in Artificial Intelligence (AI) development and application is the lack of consideration for human enhancement as a cornerstone for its operationalization. Nor is there a universally accepted approach that guides best practices in this field. However, the behavioral science field offers suggestions on how to develop a sustainable and enriching relationship between humans and intelligent machines. This paper provides a three-level (micro, meso and macro) framework on how to humanize AI with the intention of enhancing human properties and experiences. It argues that humanizing AI will help make intelligent machines not just more efficient but will also make their application more ethical and human-centric. Suggestions to policymakers, organizations, and developers are made on how to implement this framework to fix existing issues in AI and create a more symbiotic relationship between humans and machines moving into the future.
Computational linguistics. Natural language processing, Electronic computers. Computer science
Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. Three sub-modules, including linguistics-aware, prosody-aware and sentence-position networks, are trained together with a modified Tacotron2. Specifically, to learn the information embedded in a paragraph and the relations among the corresponding component sentences, we utilize linguistics-aware and prosody-aware networks. The information in a paragraph is captured by encoders and the inter-sentence information in a paragraph is learned with multi-head attention mechanisms. The relative sentence position in a paragraph is explicitly exploited by a sentence-position network. Trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise. The cross-sentence contextual information, such as break and prosodic variations between consecutive sentences, can be better predicted and rendered than the sentence-based model. Tested on paragraph texts, of which the lengths are similar to, longer than, or much longer than the typical paragraph length of the training data, the TTS speech produced by the new model is consistently preferred over the sentence-based model in subjective tests and confirmed in objective measures.
We highlight the problem of dream in the poetics of G. Gazdanov’s prose. We consider dream from the perspective of creativity pragmatics, self-knowledge and its role in the architectonics of the text. We trace the dynamics of the development of the artistic dream representation as an intertextual phenomenon. In the light of literary associations with the works of A.S. Pushkin, F.M. Dostoevsky and E. Poe, dream in G. Gazdanov’s works is defined as the development of the mythologeme Narcissus in the interpretation of A. Gide in the homonymous essay. We analyze the convergence of the dream theme and delusiveness of the real world and the oniric divinology’s influence on the narrative as a whole. The anarrative elements of metanarrative that undermine the evidence of events are described as a system of opposing the current situation to eternity. The author’s position in the novels related to the Civil War and emigration is considered in terms of M.M. Bakhtin’s work “Toward a Philosophy of the Act”. The author’s “outsideness” to the everyday world, delimiting the inverted world of violence from culture, is manifested as “different” to the narrated world. The dream appears as the universals not only in terms of the poetics of possible fictional worlds, but also as a totality in the characters’ development. The acquisition of reality is only possible in the last writer’s works, in altruism as the “awakening” of the characters and in the intention to the state of samadhi. Nirvana as a mode of the author’s “outsideness” “descends” into the characters’ world as a result of “movement of feelings” or “spiritual fire” – the prevailing feature of G. Gazdanov’s works. Thus, the delusiveness dream as a generalizing formant of metanarrative weakens as one moves away from the “horrors of history” as a traumatic experience of participation in a war and the ensuing unsettled-“unnoticed” image of the author. The polyphonism of the voices of the author and the characters in recent novels forms a single narrative field, and here and now it acquires the status of reality as enlightenment, or, in G. Gazdanov’s thesaurus, “rebirth”.
Topic classification systems on spoken documents usually consist of two modules: an automatic speech recognition (ASR) module to convert speech into text and a text topic classification (TTC) module to predict the topic class from the decoded text. In this paper, instead of using the ASR transcripts, the fusion of deep acoustic and linguistic features is used for topic classification on spoken documents. More specifically, a conventional CTC-based acoustic model (AM) using phonemes as output units is first trained, and the outputs of the layer before the linear phoneme classifier in the trained AM are used as the deep acoustic features of spoken documents. Furthermore, these deep acoustic features are fed to a phoneme-to-word (P2W) module to obtain deep linguistic features. Finally, a local multi-head attention module is proposed to fuse these two types of deep features for topic classification. Experiments conducted on a subset selected from Switchboard corpus show that our proposed framework outperforms the conventional ASR+TTC systems and achieves a 3.13% improvement in ACC.
English research articles (RAs) are an essential genre in academia, so the attempts to employ NLP to assist the development of academic writing ability have received considerable attention in the last two decades. However, there has been no study employing feature engineering techniques to investigate the linguistic features of RAs of different academic impacts (i.e., the papers of high/moderate citation times published in the journals of high/moderate impact factors). This study attempts to extract micro-level linguistic features in high- and moderate-impact journal RAs, using feature engineering methods. We extracted 25 highly relevant features from the Corpus of English Journal Articles through feature selection methods. All papers in the corpus deal with COVID-19 medical empirical studies. The selected features were then validated of the classification performance in terms of consistency and accuracy through supervised machine learning methods. Results showed that 24 linguistic features such as the overlapping of content words between adjacent sentences, the use of third-person pronouns, auxiliary verbs, tense, emotional words provide consistent and accurate predictions for journal articles with different academic impacts. Lastly, the random forest model is shown to be the best model to fit the relationship between these 24 features and journal articles with high and moderate impacts. These findings can be used to inform academic writing courses and lay the foundation for developing automatic evaluation systems for L2 graduate students.
R. Thomas McCoy, Erin Grant, Paul Smolensky
et al.
How do learners acquire languages from the limited data available to them? This process must involve some inductive biases - factors that affect how a learner generalizes - but it is unclear which inductive biases can explain observed patterns in language acquisition. To facilitate computational modeling aimed at addressing this question, we introduce a framework for giving particular linguistic inductive biases to a neural network model; such a model can then be used to empirically explore the effects of those inductive biases. This framework disentangles universal inductive biases, which are encoded in the initial values of a neural network's parameters, from non-universal factors, which the neural network must learn from data in a given language. The initial state that encodes the inductive biases is found with meta-learning, a technique through which a model discovers how to acquire new languages more easily via exposure to many possible languages. By controlling the properties of the languages that are used during meta-learning, we can control the inductive biases that meta-learning imparts. We demonstrate this framework with a case study based on syllable structure. First, we specify the inductive biases that we intend to give our model, and then we translate those inductive biases into a space of languages from which a model can meta-learn. Finally, using existing analysis techniques, we verify that our approach has imparted the linguistic inductive biases that it was intended to impart.
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features. In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use autoregressive probabilistic modeling and a neural vocoder. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages. Our experiments suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.
A standard approach to evaluating language models analyzes how models assign probabilities to valid versus invalid syntactic constructions (i.e. is a grammatical sentence more probable than an ungrammatical sentence). Our work uses ambiguous relative clause attachment to extend such evaluations to cases of multiple simultaneous valid interpretations, where stark grammaticality differences are absent. We compare model performance in English and Spanish to show that non-linguistic biases in RNN LMs advantageously overlap with syntactic structure in English but not Spanish. Thus, English models may appear to acquire human-like syntactic preferences, while models trained on Spanish fail to acquire comparable human-like preferences. We conclude by relating these results to broader concerns about the relationship between comprehension (i.e. typical language model use cases) and production (which generates the training data for language models), suggesting that necessary linguistic biases are not present in the training signal at all.