Abstract With the improvement of people's living standards, tourism users are paying more and more attention to high-quality and personalized services. However, there are currently problems with tourism recommendation, such as a lack of diversity in tourism planning routes. Therefore, a tourism recommendation model that integrates multiple algorithms and heterogeneous data sources is constructed to address these issues. This model combines interactive genetic algorithm and marine predator algorithm for scenic spot guidance search during construction, and then combines point cloud technology for scenic spot feature extraction. The experimental results show that the hybrid algorithm outperforms the marine predator algorithm, grey wolf optimization algorithm, and deep learning optimization algorithm in terms of recall rate, image recognition accuracy, and route revenue testing, with specific values of 97.88%, 98.12%, and 0.94, respectively. In addition, empirical analysis of the constructed tourist attraction recommendation model found that the normalized cumulative loss gain is generally between 0.8 and 0.9, and the predictions made for the attraction recommendation module also match the actual results, providing users with good experience. The above results indicate that the proposed tourism recommendation model has strong prediction accuracy, and the feature information extraction effect of scenic spots is good, which is conducive to user scenic spot recommendation. This study will help transform the actual needs of future tourism users and further enhance the intelligent tourism service platform.
Computational linguistics. Natural language processing, Electronic computers. Computer science
Résumé : L’usage des SMS (Short Message Service) s’est largement répandu parmi les étudiants de l’Université de Kinshasa, en particulier ceux de la Faculté des Lettres et Sciences Humaines. Ce mode de communication rapide et informel a un impact significatif sur leur processus d’apprentissage du langage, aussi bien de manière positive que négative. Sur le plan positif, les SMS favorisent une pratique quotidienne de l’écriture, ce qui peut renforcer certaines compétences linguistiques, comme la capacité à synthétiser l'information, à s’exprimer de manière concise et à enrichir le vocabulaire utilisé dans des contextes informels. Les étudiants développent également une certaine créativité linguistique à travers l’utilisation d’abréviations, de néologismes et de codes propres à leur communauté. Cependant, l’usage excessif et non encadré des SMS peut entraîner une détérioration des normes linguistiques. Les fautes d’orthographe, la grammaire approximative, et la simplification abusive du langage peuvent se transférer dans les productions écrites académiques. Ce phénomène soulève la question de l’équilibre entre l’adaptation linguistique aux nouvelles technologies et le respect des règles de la langue écrite normative. En somme, les SMS représentent un outil linguistique doublement tranchant : ils peuvent être un support d’apprentissage informel du langage, mais nécessitent un encadrement pédagogique afin de limiter leurs effets négatifs sur la maîtrise du français académique chez les étudiants.
Mots-clés : Communication, Pragmatique, Simplification, Créativité, Motivation
Arts in general, Computational linguistics. Natural language processing
Koena Ronny Mabokela, Tim Schlippe, Mpho Raborife
et al.
Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.
Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at https://github.com/Imbernoulli/SURGE.
Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos
et al.
Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.
Nuria Rodríguez-Barroso, Eugenio Martínez Cámara, Jose Camacho Collados
et al.
Abstract The annotation of ambiguous or subjective NLP tasks is usually addressed by various annotators. In most datasets, these annotations are aggregated into a single ground truth. However, this omits divergent opinions of annotators, hence missing individual perspectives. We propose FLEAD (Federated Learning for Exploiting Annotators’ Disagreements), a methodology built upon federated learning to independently learn from the opinions of all the annotators, thereby leveraging all their underlying information without relying on a single ground truth. We conduct an extensive experimental study and analysis in diverse text classification tasks to show the contribution of our approach with respect to mainstream approaches based on majority voting and other recent methodologies that also learn from annotator disagreements.
Résumé : Multidimensionnelle et polyphonique, la ville offre aux piétons et aux automobilistes une ribambelle d’écrits comme les affiches publicitaires, les tags et les grafs, les enseignes, la signalisation routière, la toponémie des rues et des lieux publics. Diffusés sur des supports très variés licites ou illicites, ces écrits constituent la communication urbaine de masse qui se caractérise par des thématiques et des enjeux divers selon leur nature licite ou illicite, institutionnelle ou privée. Ils contribuent aussi à l’aménagement de l’espace urbain en répondant aux multiples besoins sociaux, économiques et politiques des habitants de la cité. Ils exploitent et combinent des codes variés comme le texte linguistique et l’image polysémique. Ils remplissent plusieurs fonctions : référentielle, expressive, conative et esthétique.
Mots-clés : Ecrits urbains –Images – Sémiologie – Publicité - Communication
Arts in general, Computational linguistics. Natural language processing
Abstract In engineering fields, time‐varying matrix inversion (TVMI) issue is often encountered. Zeroing neural network (ZNN) has been extensively employed to resolve the TVMI problem. Nevertheless, the original ZNN (OZNN) and the integral‐enhanced ZNN (IEZNN) usually fail to deal with the TVMI problem under unbounded noises, such as linear noises. Therefore, a neural network model that can handle the TVMI under linear noise interference is urgently needed. This paper develops a double integral‐enhanced ZNN (DIEZNN) model based on a novel integral‐type design formula with inherent linear‐noise tolerance. Moreover, its convergence and robustness are verified by derivation strictly. For comparison and verification, the OZNN and the IEZNN models are adopted to resolve the TVMI under multiple identical noise environments. The experiments proved that the DIEZNN model has excellent advantages in solving TVMI problems under linear noises. In general, the DIEZNN model is an innovative work and is proposed for the first time. Satisfyingly, the errors of DIEZNN are always less than 1 × 10−3 under linear noises, whereas the error norms of OZNN and IEZNN models are not convergent to zero. In addition, these models are applied to the control of the controllable permanent magnet synchronous motor chaotic system to indicate the superiority of the DIEZNN.
Computational linguistics. Natural language processing, Computer software
Abstract: This study falls within the field of morphological and semantic analysis of the Quran, using Surah Nuh as a case study. Its objective is to explore the role and implications of morphological patterns in the surah to uncover the intended Quranic meanings. The study's findings reveal that past tense verbs are used to highlight the rejection of Nuh’s message by his people and to emphasize the abundance of Allah's blessings. Present tense verbs are employed to call for piety and repentance, while imperative verbs are linked to Allah’s command to Nuh to warn his people. Additionally, derived nouns and forms of exaggeration reflect meanings of permanence and continuity, highlighting the characteristics of Nuh’s people such as disbelief and injustice.
Keywords: Morphological Structure, Semantics, Quran, Surah Nuh.
Arts in general, Computational linguistics. Natural language processing
The alternation between syndetic (je pense qu’il ne va pas venir) and asyndetic (je pense Ø c’est à onze heures) sequences has been quantitatively studied in English (Jaeger, 2010) and in Quebec French (Liang et al., 2021), but not in metropolitan French. Thus, focusing on the verbs croire and penser gathered under the “weak verbs” category, this article presents a quantitative study conducted on two spoken French corpora: MPF (“Multicultural Paris French”) and CEFC (“Corpus d’étude du français contemporain”). 1,249 occurrences of these verbs were manually annotated for a set of linguistic (person, subject category, presence of negation, etc.) and non-linguistic (communicative proximity/distance) factors. Using a logistic regression model, the significance of a set of heterogeneous factors in the choice of construction is established. This article focuses on the communication factor: interactions characterized by proximity and connivance display a greater proportion of asyndetic constructions.
Philology. Linguistics, Computational linguistics. Natural language processing
This monograph presents a theoretical background and a broad introduction to the Min-Max Framework for Majorization-Minimization (MM4MM), an algorithmic methodology for solving minimization problems by formulating them as min-max problems and then employing majorization-minimization. The monograph lays out the mathematical basis of the approach used to reformulate a minimization problem as a min-max problem. With the prerequisites covered, including multiple illustrations of the formulations for convex and non-convex functions, this work serves as a guide for developing MM4MM-based algorithms for solving non-convex optimization problems in various areas of signal processing. As special cases, we discuss using the majorization-minimization technique to solve min-max problems encountered in signal processing applications and min-max problems formulated using the Lagrangian. Lastly, we present detailed examples of using MM4MM in ten signal processing applications such as phase retrieval, source localization, independent vector analysis, beamforming and optimal sensor placement in wireless sensor networks. The devised MM4MM algorithms are free of hyper-parameters and enjoy the advantages inherited from the use of the majorization-minimization technique such as monotonicity.
Parkinsons disease is the second most prevalent neurodegenerative disorder with over ten million active cases worldwide and one million new diagnoses per year. Detecting and subsequently diagnosing the disease is challenging because of symptom heterogeneity with respect to complexity, as well as the type and timing of phenotypic manifestations. Typically, language impairment can present in the prodromal phase and precede motor symptoms suggesting that a linguistic-based approach could serve as a diagnostic method for incipient Parkinsons disease. Additionally, improved linguistic models may enhance other approaches through ensemble techniques. The field of large language models is advancing rapidly, presenting the opportunity to explore the use of these new models for detecting Parkinsons disease and to improve on current linguistic approaches with high-dimensional representations of linguistics. We evaluate the application of state-of-the-art large language models to detect Parkinsons disease automatically from spontaneous speech with up to 73% accuracy.
Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve inductive and deductive reasoning from reference instances, we examine whether diverse auxiliary demonstrations can be automatically induced from seed exemplars, through analogical prompting. We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context along with provided target language exemplars. Our results on the modeLing dataset show that analogical prompting is effective in eliciting models' knowledge of language grammar similarities, boosting the performance of GPT-4o by as much as 8.1% and Llama-3.1-405B-Instruct by 5.9% over chain-of-thought approaches. These gains are attributable to the analogical demonstrations, both when self-generated as well as when produced by weaker multilingual models. Furthermore, we demonstrate that our method generalizes to other tasks present in Linguistics Olympiad competitions, achieving sizable improvements across all problem types and difficulty levels included in the LINGOLY dataset with GPT-4o. We also report several findings about interesting phenomena which drive linguistic reasoning performance, suggesting that such puzzles are a valuable benchmark for new reasoning methods.
Speculative decoding is a powerful technique that attempts to circumvent the autoregressive constraint of modern Large Language Models (LLMs). The aim of speculative decoding techniques is to improve the average inference time of a large, target model without sacrificing its accuracy, by using a more efficient draft model to propose draft tokens which are then verified in parallel. The number of draft tokens produced in each drafting round is referred to as the draft length and is often a static hyperparameter chosen based on the acceptance rate statistics of the draft tokens. However, setting a static draft length can negatively impact performance, especially in scenarios where drafting is expensive and there is a high variance in the number of tokens accepted. Adaptive Entropy-based Draft Length (AdaEDL) is a simple, training and parameter-free criteria which allows for early stopping of the token drafting process by approximating a lower bound on the expected acceptance probability of the drafted token based on the currently observed entropy of the drafted logits. We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57% as well as other training-free draft-stopping techniques by upto 10% in a variety of settings and datasets. At the same time, we show that AdaEDL is more robust than these techniques and preserves performance in high-sampling-temperature scenarios. Since it is training-free, in contrast to techniques that rely on the training of dataset-specific draft-stopping predictors, AdaEDL can seamlessly be integrated into a variety of pre-existing LLM systems.
With the development of large language models (LLMs), social biases in these LLMs have become a pressing issue. Although there are various benchmarks for social biases across languages, the extent to which Japanese LLMs exhibit social biases has not been fully investigated. In this study, we construct the Japanese Bias Benchmark dataset for Question Answering (JBBQ) based on the English bias benchmark BBQ, with analysis of social biases in Japanese LLMs. The results show that while current open Japanese LLMs with more parameters show improved accuracies on JBBQ, their bias scores increase. In addition, prompts with a warning about social biases and chain-of-thought prompting reduce the effect of biases in model outputs, but there is room for improvement in extracting the correct evidence from contexts in Japanese. Our dataset is available at https://github.com/ynklab/JBBQ_data.
Résumé: Cette étude porte sur le droit à l'information. Les personnes arrêtées doivent être informées de leurs droits prévus à l'article 18 de la Constitution de la République Démocratique du Congo. En effet, ce droit est très souvent ignoré dans la pratique, et l’ignorance à l'information a de graves conséquences. Car la méconnaissance du maintien de ce statut facilite grandement la violation d'autres droits. Le droit d'être immédiatement informé des raisons de son arrestation et des accusations portées contre une personne, ainsi que le droit d'avoir un contact immédiat avec sa famille, ne sont pas toujours respectés. Obtenez l'aide d'un avocat. Tout ce que vous avez à faire est de jeter un œil à Amigo et de vous rendre compte que vos droits ont été violés. Dans un nombre important de cas, comme à Kinshasa et à Bandundu, les limites de durée de détention et les mandats d'arrêt n'ont pas été respectés.
Mots-clés: droit, inculpé, pendant, l’instruction, juridictionnelle
Arts in general, Computational linguistics. Natural language processing
Résumé : L’historique de la décentralisation en Côte d’Ivoire met en évidence la défiance de l’autorité de l’Etat à travers l’aménagement et la gestion des espaces de transport urbain par des syndicats, nonobstant le désordre et l’insécurité occasionnés par cette situation. Mais depuis 2020, à Koumassi, ces transporteurs exercent leurs activités à la gare Yaya Fofana, espace aménagé et géré par les collectivités locales. L’objectif de ce travail est d’analyser les déterminants sociaux de l’adhésion des syndicats de transporteurs à ce projet d’aménagement urbain. Une recherche documentaire associée à une enquête de terrain ont permis de comprendre que les syndicats de transporteurs perçoivent cet aménagement comme un projet participatif. A travers la négociation engagée par la collectivité locale, les besoins des transporteurs sont satisfaits. Toutefois, les ressources mobilisées pour parvenir à cette négociation montrent qu’elle constitue une stratégie d’affirmation de l’autorité étatique qui permet de légitimer les règles d’aménagement urbain et de parvenir à l’adhésion au projet.
Mots clés : aménagement urbain, participation, adhésion, pouvoir, autorité
Arts in general, Computational linguistics. Natural language processing
Transformer is a state-of-the-art model in the field of natural language processing (NLP). Current NLP models primarily increase the number of transformers to improve processing performance. However, this technique requires a lot of training resources such as computing capacity. In this paper, a novel structure of Transformer is proposed. It is featured by full layer normalization, weighted residual connection, positional encoding exploiting reinforcement learning, and zero masked self-attention. The proposed Transformer model, which is called Enhanced Transformer, is validated by the bilingual evaluation understudy (BLEU) score obtained with the Multi30k translation dataset. As a result, the Enhanced Transformer achieves 202.96% higher BLEU score as compared to the original transformer with the translation dataset.