W. Mann
Hasil untuk "Computational linguistics. Natural language processing"
Menampilkan 20 dari ~30233 hasil · dari DOAJ, arXiv, Semantic Scholar
Yang-Hui He, Vishnu Jejjala, B. Nelson
We apply techniques in natural language processing, computational linguistics, and machine-learning to investigate papers in hep-th and four related sections of the arXiv: hep-ph, hep-lat, gr-qc, and math-ph. All of the titles of papers in each of these sections, from the inception of the arXiv until the end of 2017, are extracted and treated as a corpus which we use to train the neural network Word2Vec. A comparative study of common n-grams, linear syntactical identities, word cloud and word similarities is carried out. We find notable scientific and sociological differences between the fields. In conjunction with support vector machines, we also show that the syntactic structure of the titles in different sub-fields of high energy and mathematical physics are sufficiently different that a neural network can perform a binary classification of formal versus phenomenological sections with 87.1% accuracy, and can perform a finer five-fold classification across all sections with 65.1% accuracy.
Vincent Ooi
Nisar Hussain, Amna Qasim, Gull Mehak et al.
Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in spellings for the same word, and high levels of code-mixing with English, which together make automated insult detection for Roman Urdu a highly complex problem. To address these problems, we created a large-scale dataset with 46,045 labeled comments from social media websites such as Twitter, Facebook, and YouTube. This is the first dataset for insult detection for Roman Urdu that was created and annotated with insulting and non-insulting content. Advanced preprocessing methods such as text cleaning, text normalization, and tokenization are used in the study, as well as feature extraction using TF–IDF through unigram (Uni), bigram (Bi), trigram (Tri), and their unions: Uni+Bi+Trigram. We compared ten machine learning algorithms (logistic regression, support vector machines, random forest, gradient boosting, AdaBoost, and XGBoost) and three deep learning topologies (CNN, LSTM, and Bi-LSTM). Different models were compared, and ensemble ones were proven to give the highest F1-scores, reaching 97.79%, 97.78%, and 95.25%, respectively, for AdaBoost, decision tree, TF–IDF, and Uni+Bi+Trigram configurations. Deeper learning models also performed on par, with CNN achieving an F1-score of 97.01%. Overall, the results highlight the utility of n-gram features and the combination of robust classifiers in detecting insults. This study makes strides in improving NLP for Roman Urdu, yet further research has established the foundation of pre-trained transformers and hybrid approaches; this could overcome existing systems and platform limitations. This study has conscious implications, mainly on the construction of automated moderation tools to achieve safer online spaces, especially for South Asian social media websites.
Brenda Salenave Santana, Ricardo Campos, Evelin Amorim et al.
Narratives are present in many forms of human expression and can be understood as a fundamental way of communication between people. Computational understanding of the underlying story of a narrative, however, may be a rather complex task for both linguists and computational linguistics. Such task can be approached using natural language processing techniques to automatically extract narratives from texts. In this paper, we present an in depth survey of narrative extraction from text, providing a establishing a basis/framework for the study roadmap to the study of this area as a whole as a means to consolidate a view on this line of research. We aim to fulfill the current gap by identifying important research efforts at the crossroad between linguists and computer scientists. In particular, we highlight the importance and complexity of the annotation process, as a crucial step for the training stage. Next, we detail methods and approaches regarding the identification and extraction of narrative components, their linkage and understanding of likely inherent relationships, before detailing formal narrative representation structures as an intermediate step for visualization and data exploration purposes. We then move into the narrative evaluation task aspects, and conclude this survey by highlighting important open issues under the domain of narratives extraction from texts that are yet to be explored.
Matthias Schoffel, Marinus Wiedner, E. Arias et al.
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora-hagiographical and medical texts-we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.
Xiaoyi Tang, Hongwei Chen, Daoyu Lin et al.
Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multi-dimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AI-powered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.
Natalia Borza
The publication of the first storybook for children about LGBTQ people in the Hungarian language gave rise to controversy. The study seeks to delineate the social perception of Gender Education (GE) for children in Hungary by analyzing the voice of the Hungarian people who showed a genuine interest in the issue. The research draws on the online comments (N=340) which were displayed on the website that shared the official video about the LGBTQ children’s book launch in a five-month period. In the online comments, social normativity was uncovered by using Van Leeuwen’s (2008) taxonomy of discursive (de)legitimation strategies. The findings show that both the supporters and the opponents of GE for children apply a wide variety of discursive constructions of (de)legitimation to express their competing values. However, there is a degree of overlap between the arguments of the two groups, which hold opposing positions with regard to norms.
Miguel Martín Echarri, Rocío Quintanilla López Tafall
Frente a la idea de que el acento extranjero no requiere tratamiento en el aula de ELE si no genera problemas de comunicación, nuestra hipótesis fue que la comprensibilidad de los enunciados de hablantes con acento extranjero se vería más afectada por el ruido que la de hablantes nativos. Para validarla se realizaron pruebas de percepción auditiva con jueces hispanohablantes no expertos y profesores de ELE, aplicando análisis estadísticos de normalidad (Shapiro-Wilk), comparación de grupos (Mann-Whitney U), análisis de la varianza (ANOVA) y correlación (Spearman). Los resultados muestran que las producciones con acento extranjero experimentan una reducción significativa en su comprensibilidad cuando se superpone ruido, especialmente en la primera escucha. Además, mientras en la evaluación de aprendientes de español hubo una correlación moderada entre jueces hispanohablantes no expertos y profesores de ELE, en la evaluación de hablantes nativos no se encontró correlación, lo que indica diferencias en los criterios de evaluación según la experiencia lingüística de los jueces. Ello confirma empíricamente el impacto del acento extranjero en la comprensibilidad en condiciones adversas e identifica diferencias significativas en la percepción del acento entre expertos y no expertos en ELE.
Hajar Rabie, Moulay A. Akhloufi
Abstract Millions of people worldwide suffer from Parkinson’s disease (PD), a neurodegenerative disorder marked by motor symptoms such as tremors, bradykinesia, and stiffness. Accurate early diagnosis is crucial for effective management and treatment. This article presents a novel review of Machine Learning (ML) and Deep Learning (DL) techniques for PD detection and progression monitoring, offering new perspectives by integrating diverse data sources. We examine the public datasets recently used in studies, including audio recordings, gait analysis, and medical imaging. We discuss the preprocessing methods applied, the state-of-the-art models utilized, and their performance. Our evaluation included different algorithms such as support vector machines (SVM), random forests (RF), convolutional neural networks (CNN). These algorithms have shown promising results in PD diagnosis with accuracy rates exceeding 99% in some studies combining data sources. Our analysis particularly showcases the effectiveness of audio analysis in early symptom detection and gait analysis, including the Unified Parkinson’s Disease Rating Scale (UPDRS), in monitoring disease progression. Medical imaging, enhanced by DL techniques, has improved the identification of PD. The application of ML and DL in PD research offers significant potential for improving diagnostic accuracy. However, challenges like the need for large and diverse datasets, data privacy concerns, and data quality in healthcare remain. Additionally, developing explainable AI is crucial to ensure that clinicians can trust and understand ML and DL models. Our review highlights these key challenges that must be addressed to enhance the robustness and applicability of AI models in PD diagnosis, setting the groundwork for future research to overcome these obstacles.
Edi Sutoyo, Muhammad Cekas Permana
Sentiment analysis, a field within natural language processing, text mining, and computational linguistics, evaluates user opinions and product ratings. This article describes a dataset of user reviews collected from telemedicine applications in Indonesia to understand sentiments related to service quality. The dataset comprises 255,679 textual reviews containing positive and negative feedback, offering valuable input for analyzing user experiences. Reviews were sourced from publicly available platforms, ensuring diversity in user perspectives.The dataset exhibits significant class imbalance, with negative reviews constituting a small proportion compared to positive reviews (ratio exceeding 1:14). To address this imbalance, advanced resampling techniques, including Easy Data Augmentation (EDA), were applied. The dataset underwent rigorous preprocessing to remove noise, standardize content, and tokenize reviews for compatibility with deep learning models.This dataset has been utilized with architectures such as SRNN, 1D-CNN, 1L-LSTM, and BiLSTM for sentiment classification. Generated word clouds highlight frequently mentioned terms, enabling exploratory analysis. The dataset is publicly available, providing a resource for benchmarking sentiment classification algorithms and studying the impact of imbalanced data handling on model performance. This work contributes to enhancing telemedicine service quality and advancing Indonesian natural language processing research.
Vincenzo Damiazzi
This study employs a perceptual and an acoustic analysis of a corpus of spontaneous utterances taken from German podcasts on sustainability. The objective is to identify and examine the differences in the perception of prominence between two groups of participants: a group of adult Italian learners of German as a Foreign Language (GFL) and a group of German native speakers. The results of the analysis lend support to the hypothesis that the differences in prominence perception can be attributed to perception patterns based on duration for Italian native speakers and on pitch variations for German native speakers. In light of the evidence presented, some considerations are made regarding the potential applications of acoustic analysis in GFL courses. The potential applications of this approach are investigated, based on a simplified visualisation of prosody, by means of F0 curves enriched with information on the periodic energy of the acoustic signal.
Charles M. Varmantchaonala, Niclas Götting, Nils-Erik Schütte et al.
Quantum Natural Language Processing (QNLP) offers a novel approach to encoding and understanding the complexity of natural languages through the power of quantum computation. This paper presents a pretrained quantum context-sensitive embedding model, called QCSE, that captures context-sensitive word embeddings, leveraging the unique properties of quantum systems to learn contextual relationships in languages. The model introduces quantum-native context learning, enabling the utilization of quantum computers for linguistic tasks. Central to the proposed approach are innovative context matrix computation methods, designed to create unique, representations of words based on their surrounding linguistic context. Five distinct methods are proposed and tested for computing the context matrices, incorporating techniques such as exponential decay, sinusoidal modulation, phase shifts, and hash-based transformations. These methods ensure that the quantum embeddings retain context sensitivity, thereby making them suitable for downstream language tasks where the expressibility and properties of quantum systems are valuable resources. To evaluate the effectiveness of the model and the associated context matrix methods, evaluations are conducted on both a Fulani corpus, a low-resource African language, dataset of small size and an English corpus of slightly larger size. The results demonstrate that QCSE not only captures context sensitivity but also leverages the expressibility of quantum systems for representing rich, context-aware language information. The use of Fulani further highlights the potential of QNLP to mitigate the problem of lack of data for this category of languages. This work underscores the power of quantum computation in natural language processing (NLP) and opens new avenues for applying QNLP to real-world linguistic challenges across various tasks and domains.
Zhixian Zhao, Xinfa Zhu, Xinsheng Wang et al.
Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.
Silvia Fernandez-Sabido, Laura Peniche-Sabido
In this paper, we offer an overview of indigenous languages, identifying the causes of their devaluation and the need for legislation on language rights. We review the technologies used to revitalize these languages, finding that when they come from outside, they often have the opposite effect to what they seek; however, when developed from within communities, they become powerful instruments of expression. We propose that the inclusion of Indigenous knowledge in large language models (LLMs) will enrich the technological landscape, but must be done in a participatory environment that encourages the exchange of knowledge.
Gary Lupyan
Acknowledging that large language models have learned to use language can open doors to breakthrough language science. Achieving these breakthroughs may require abandoning some long-held ideas about how language knowledge is evaluated and reckoning with the difficult fact that we have entered a post-Turing test era.
Wanxiang Che, Zhenghua Li, Ting Liu
Jinyi Zhang, Ke Su, Haowei Li et al.
Machine translation–the automatic transformation of one natural language (source language) into another (target language) through computational means–occupies a central role in computational linguistics and stands as a cornerstone of research within the field of Natural Language Processing (NLP). In recent years, the prominence of Neural Machine Translation (NMT) has grown exponentially, offering an advanced framework for machine translation research. It is noted for its superior translation performance, especially when tackling the challenges posed by low-resource language pairs that suffer from a limited corpus of data resources. This article offers an exhaustive exploration of the historical trajectory and advancements in NMT, accompanied by an analysis of the underlying foundational concepts. It subsequently provides a concise demarcation of the unique characteristics associated with low-resource languages and presents a succinct review of pertinent translation models and their applications, specifically within the context of languages with low-resources. Moreover, this article delves deeply into machine translation techniques, highlighting approaches tailored for Chinese-centric low-resource languages. Ultimately, it anticipates upcoming research directions in the realm of low-resource language translation.
Muhammad Ilham Ali
This study looks at how this novel depicts a society that is both fascinated by science and terrified. This study in the book "The Andromeda Strain" offers a perceptive viewpoint on pandemics and their social repercussions. This research analyzes the social conditions and circumstances during the pandemic in The Andromeda Strain. This research uses the descriptive qualitative method with the Mimetic and sociological approaches. Based on the research findings and analysis, the result shows the existence of social changes due to the pandemic in this novel. These changes are the actions all the characters involved in the book take. These actions include cases of mass death due to a bacterium called Andromeda and cases of suicide, but there are also prevention efforts such as lockdowns and quarantines or efforts tackling bacteria; research called the wildfire project was also held, and there was also cooperation in dealing with existing cases of andromeda bacteria. Keywords: Pandemic, Social Situation, The Andromeda Strain.
Kristen Hawley Turner
Halaman 21 dari 1512