Hasil "Computational linguistics. Natural language processing"

S2 Open Access 2019

CamemBERT: a Tasty French Language Model

Louis Martin, Benjamin Muller, Pedro Ortiz Suarez et al.

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

1073 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2002

NLTK: The Natural Language Toolkit

E. Loper, Steven Bird

The Natural Language Toolkit is a suite of program modules, data sets and tutorials supporting research and teaching in computational linguistics and natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past year the toolkit has been rewritten, simplifying many linguistic data structures and taking advantage of recent enhancements in the Python language. This paper reports on the simplified toolkit and explains how it is used in teaching NLP.

1445 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2021

Measuring and Improving Consistency in Pretrained Language Models

Yanai Elazar, Nora Kassner, Shauli Ravfogel et al.

Abstract Consistency of a model—that is, the invariance of its behavior under meaning-preserving alternations in its input—is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel🤘, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel🤘, we show that the consistency of all PLMs we experiment with is poor— though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.1

460 sitasi en Computer Science

Detail DOI Sumber

DOAJ Open Access 2026

Research on wideband acoustic imaging method based on sparse Bayesian learning

Zhaoyi Liao, Qiang Zeng, Lirong Liu et al.

Abstract In response to the low accuracy of acoustic imaging with small microphone arrays, an acoustic imaging approach is proposed based on Sparse Bayesian Learning. A two-layer conjugate prior structure is built to model the signal. Additionally, it leverages shared parameters across frequency bands to achieve joint utilization of multi-frequency information, thereby reducing the adverse impact of column vector correlations in the observation matrix. All variables are modeled using conjugate priors, ensuring closed-form solutions exist for the update of all model variables. The experimental results demonstrate that the proposed method effectively improves the accuracy performance of acoustic imaging.

Computational linguistics. Natural language processing, Electronic computers. Computer science

Detail DOI Sumber

DOAJ Open Access 2026

A model for assessing college students’ entrepreneurial abilities based on deep learning and big data

Haina Guo

Abstract Entrepreneurship is widely acknowledged as a catalyst for innovation and economic growth; assessing entrepreneurial abilities in college students is a strategic imperative in higher education. Conventional evaluation approaches often suffer from limited scalability, subjectivity, and insufficient predictive power, particularly when applied to large and complex datasets. These challenges necessitate advanced, intelligent frameworks capable of extracting meaningful patterns from educational data. This research proposes a novel deep learning architecture named Battle Royale Optimized Self-Attention Bi-Directional Long Short-Term Memory (BRO-SA-Bi-LSTM) to assess entrepreneurial ability among college students by leveraging big data. The BRO algorithm enhances exploration and convergence during training, while the SA mechanism enables the model to assign dynamic importance to key features. The Bi-LSTM structure captures both past and future contextual dependencies, improving temporal learning and feature retention. The dataset, compiled from multiple academic institutions, includes variables such as academic performance in entrepreneurship courses, participation in innovation clubs, entrepreneurial experience, structured training involvement, and funding acquisition. Z-score normalization is applied during preprocessing to standardize data distributions. PCA is utilized for dimensionality reduction and feature extraction. The BRO-SA-Bi-LSTM model is implemented using Python and TensorFlow. Evaluation metrics include F1-score, accuracy, recall, and precision, ranging from 95 to 99%. Results show that the proposed model outperforms conventional architectures, delivering high predictive accuracy and superior generalization across test data. The proposed framework provides a scalable AI-based assessment strategy that can support curriculum design, personalized training, and large-scale educational evaluation.

Computational linguistics. Natural language processing, Electronic computers. Computer science

Detail DOI Sumber

DOAJ Open Access 2025

Namdeutsch und Namslang.

Silvia Verdiani

Namdeutsch and Namslang are two German-based non-standard varieties that contain loanwords from Afrikaans, English and indigenous languages. Their development began during German colonial rule in German South West Africa, now Namibia. Namdeutsch is a linguistic variant that has become established among German-speakers in Namibia in recent decades, while Namslang or NAM Släng has recently become widely used online by Namibian German-speaking youth communities. My paper focuses on differently marked multimodal online communication to identify and describe typical features of German diversity in Namibia. It will also identify the role of digital multimodality, based on the interaction of visual and linguistic material, in the constitution of meaning in this context. The aim is to examine how this comes to light through the online press and bilingual/trilingual conversations in social networks, and how speaker identities in social media are (co-)constructed through the varieties of Namdeutsch and Namslang. The research is based on the Facebook page and the NAM Släng dictionary of the musician EES (Eric Sell), the glosses of the online edition of the Allgemeine Zeitung and the Deutsch in Namibia (DNAM) corpus.

Computational linguistics. Natural language processing, Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2025

PDNet: a lightweight attention-guided CNN for efficient pallet racking defect detection on edge devices

Rahima Khanam, Muhammad Hussain, Richard Hill

Abstract This study presents three algorithmic innovations aimed at optimizing Convolutional Neural Network architectures for automated defect detection in pallet rack inspection scenarios. First, a domain-specific Augmentation Algorithm is introduced to expand the training dataset, mitigate overfitting, and enhance the model’s generalization by emphasizing critical visual features associated with structural defects. Second, a guided CNN Development Mechanism facilitates architectural optimization by systematically refining filter sizes, neuron counts, and convolutional block configurations, enabling high performance with reduced parameter overhead. Third, the proposed PalletDetect Module (PD-M) enhances computational efficiency by adaptively refining feature representations at the input tensor level, reducing complexity while preserving discriminative capacity. These algorithms collectively produce PDNet, a compact CNN that enables real-time pallet racking inspection on resource-constrained edge devices. PDNet achieves an accuracy of 92.07%, with a computational complexity of only 32.31 million multiply–accumulate operations (MMAC) and a compact memory footprint of 31.36 MB. Compared to modern lightweight CNNs such as MobileNetV3 and ShuffleNetV2, PDNet offers a superior balance between accuracy, speed, and computational efficiency, demonstrating its potential for real-time industrial inspection applications.

Computational linguistics. Natural language processing, Electronic computers. Computer science

Detail DOI Sumber

CrossRef Open Access 2024

An Experimental Analysis of Monosyllabic Tone and Disyllabic Tone in Wuming Mandarin

Yi Yuan, Hongli Deng

en

Detail DOI Sumber

DOAJ Open Access 2024

Detection of tuberculosis using cough audio analysis: a deep learning approach with capsule networks

Sakthi Jaya Sundar Rajasekar, Anu Rithiga Balaraman, Deepa Varnika Balaraman et al.

Abstract Purpose Tuberculosis (TB) is a widespread infectious disease that requires early detection for effective treatment and control. This study aims to improve TB detection using cough audio analysis, comparing the performance of capsule networks to other deep learning models. Methods We used cough audio recordings from 1105 individuals with a new or worsening cough for at least two weeks, totaling 9772 recordings. These recordings were processed into spectral images, and HOG features were extracted. Various models, including Capsule Networks + FCNN, CNN, VGG16, and ResNet50 were trained and evaluated. Results Capsule Networks + FCNN achieved the best performance with an accuracy of 0.97, sensitivity of 0.98, specificity of 0.96, F1 score of 0.97, and precision of 0.97, outperforming other models. This attribute is due to the model’s ability to learn complex features from spectral images. Conclusions This study concludes that Capsule Networks are more effective than typical CNN-based models in diagnosing TB from cough audio. This suggests that advanced deep learning frameworks could significantly enhance TB screening accuracy, especially in resource-limited areas.

Computational linguistics. Natural language processing, Electronic computers. Computer science

Detail DOI Sumber

DOAJ Open Access 2024

POS Tagging and Lemmatization of Historical Varieties of Languages. The Challenge of Old Italian

Manuel Favaro, Marco Biffi, Simonetta Montemagni

The paper discusses the challenges of POS tagging and lemmatization of historical varieties of Italian, and reports for both tasks the results of experiments carried out in a classical supervised domain adaptation scenario using the diachronic and typologically differentiated corpus built for the "Vocabolario Dinamico dell’Italiano Moderno" (VoDIM). For what concerns POS tagging, the effectiveness of retrained models is illustrated and substantiated with quantitative data, with a specific view to linguistic annotation results obtained with respect to specific language evolution stages, domains and textual genres. For lemmatization, different customized models have been developed, including lexicon-assisted ones and models retrained with historical annotated texts. In both cases, a detailed error analysis is provided.

Social Sciences, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2024

Traduction des locutions juridiques latines : une analyse lexicale et morphosyntaxique

Irène KEBIHENG A MABEN

Résumé : Le langage juridique est un type spécialisé qui présente des particularités. La langue latine, aujourd’hui considérée comme une langue morte dans les échanges, continue d’exister dans certaines sphères, parmi lesquelles le milieu juridique. Il existe une panoplie d’expressions, de locutions latines qui sont employées à la justice, qu’il s’agisse des plaidoyers ou même des publications. À l’observation, la traduction de ces locutions en français présente des variations susceptibles de remettre en cause la justesse de la langue de la justice. En effet, le caractère pas toujours fidèle de ces traductions interpelle davantage parce qu’il s’agit du langage de la justice et qu’il est attendu de lui qu’il soit le plus juste possible, à l’image du milieu qu’il sert. Nous nous intéressons au caractère juste de la traduction des locutions latines et en observons les variations lexicales et morphosyntaxiques. De ces observations, nous déduirons les implications possibles du manque de justesse de ces traductions d’un point de vue pragmatique. Les réponses à ces centres d’intérêt seront abordées sous le prisme du distributionnalisme, vu sous l’angle de Leonard Bloomfield, lequel considère que les analyses en langue tiennent compte des éléments visibles et observables (Bloomfield 1933 : 69). Pour comprendre l’impact de l’utilisation de ces locutions, nous convoquerons aussi la pragmatique linguistique selon Oswald Ducrot, qui conçoit le dit au-delà de l’élément visible. (Ducrot 1972 : 44). Dans notre méthodologie d’étude, nous avons travaillé avec un article de la revue Iurisma.com en ligne, qui propose un lexique des locutions juridiques latines et leurs traductions. Nous analyserons les traductions latines d’un point de vue lexical et morphosyntaxique et ressortirons les éléments qui remettent en cause le caractère juste de ces locutions juridiques. Nous envisageons aussi l’impact que peut avoir l’utilisation de ces locutions dans le jargon juridique. Les résultats attendus reposent dans un premier temps sur l’identification des formes lexicales et morphosyntaxiques qui varient dans les traductions, et dans un second temps, sur l’impact de ces variations dans la réception des textes juridiques. Mots-clés : locutions latines, textes juridiques, traduction, lexique, morphosyntaxe

Arts in general, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2023

Entonación y pragmática en conversaciones telefónicas: las preguntas absolutas en las variedades dominicana y puertorriqueña

Carolina Gomes da Silva, Mayra Suézia Oliveira dos Santos , Mikaellen Kawany do Nascimento et al.

Esta investigación averigua la correlación entre forma prosódica y función pragmática de enunciados interrogativos absolutos caribeños de la República Dominicana y Puerto Rico, realizados a través de conversaciones telefónicas coloquiales. El objetivo general es describir la entonación de los enunciados interrogativos absolutos de las variedades dominicana y puertorriqueña. Metodológicamente, se han seleccionado dos conversaciones telefónicas, una dominicana y otra puertorriqueña, de 15 minutos cada una, y se han obtenido 53 enunciados. Pragmáticamente, se utilizaron las categorías de análisis de Gabriel (2018); acústicamente, se manejó el software de análisis del habla Praat (Boersma y Weenink, 1992-2022); y fonológicamente, se utilizó el modelo Sp_ToBI (Prieto y Roseano, 2018). En los resultados, el movimiento descendente, H+L*L%, fue mayoritario en República Dominicana, y el circunflejo, L+H*L%, fue más frecuente en Puerto Rico, confirmando los datos descritos para otros tipos de muestras orales (Armstrong, 2010; Willis, 2010). Se concluye señalando la importancia de realizar estudios fonológicos entonativos para ambas variedades.

Philology. Linguistics, Computational linguistics. Natural language processing

Detail DOI Sumber

CrossRef Open Access 2022

Novelty Detection: A Perspective from Natural Language Processing

Tirthankar Ghosal, Tanik Saikh, Tameesh Biswas et al.

AbstractThe quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.

19 sitasi en

Detail DOI Sumber

DOAJ Open Access 2022

Conscious Suspension of ‘Stranger Danger’: Fetishism in Disgrace by J. M. Coetzee

Saud Hanif, Fatima Syeda

This paper explores the politics of fetishism in J. M. Coetzee’s Disgrace. It contends that the characters in Coetzee’s novel appear to consciously suspend the danger that is often associated with strangers. They willingly interact with those in their neighborhoods and consciously suppress the warnings of mishaps or misfortune that such strange encounters may entail. This study is a textual analysis of the selected text to examine all the encounters, specifically those between Lucy and the natives, as well as those of Bev Shaw with David Lurie. In the post-colonial period, such abnormal behavior notably reveals a distinct feature of fetishism among the colonizer and the colonized, as it does in Coetzee’s characters. In order to achieve this objective, Coetzee’s novel is the primary text, while Sara Ahmed’s concepts of “Stranger Fetishism” and “Encounters” serve as a theoretical framework for this investigation. The objective of this research is to analyze the ambiguities of characters in the novel, such as Lucy’s passive response to her rape and Bev’s ironic behavior with David. Eventually, the discussion leads to the unfolding of their personal intents, which are deliberately kept undercover to satisfy their personal interests. This research intervenes in the current scholarship on postcolonial studies by bringing fresh reading perspectives on Coetzee’s text.

Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2022

Comparative Study of Allama Iqbal’s Urdu and Persian Composition of ode

Dr. Satar Khan Khattak

Allama Muhammad Iqbal started his poetry from Urdu ode or amatory verses. But very soon he started his poetry in Persian. Iqbal realized that the skirt of Urdu language is very narrow for his ideas and thoughts. He recognized that Urdu is a young inexperienced language. Urdu is spoken, written and reading in a limited part of the subcontinent. On the other hand person is an old and experienced language of the world and is spoken, written and reading in a vast part of the Muslim world. Persian language keeps the most valuable assets of poetry and prose. The Persians odes of Allama Iqbal are found in Piyam-e-mashriq and Zaboor-e-Ajam. Some Urdu odes of Iqbal are found in Bang-e-Dara and Zarb-e-kaleem, but the most important odes are found in his famous book of Urdu poetry named as Bal-e-jibreel. This collection of poems by Allama Iqbal is very important, because what is clearly stated in his Persian odes, is what is indicated in Bal-e-Jibreel. The first part of Bal-e-Jibreel consists of ghazals. Essentially, these ghazals portray the same meaning that the Persian ghazals imply. However, the experimental writing of these ghazals, the sheer talent employed in this book are as climactic in terms of poetry.

Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2021

MasakhaNER: Named Entity Recognition for African Languages

David Ifeoluwa Adelani, Jade Abbott, Graham Neubig et al.

AbstractWe take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.1

Computational linguistics. Natural language processing

Detail DOI Sumber

S2 Open Access 2015

Reasoning about Quantities in Natural Language

Subhro Roy, Tim Vieira, Dan Roth

Little work from the Natural Language Processing community has targeted the role of quantities in Natural Language Understanding. This paper takes some key steps towards facilitating reasoning about quantities expressed in natural language. We investigate two different tasks of numerical reasoning. First, we consider Quantity Entailment, a new task formulated to understand the role of quantities in general textual inference tasks. Second, we consider the problem of automatically understanding and solving elementary school math word problems. In order to address these quantitative reasoning problems we first develop a computational approach which we show to successfully recognize and normalize textual expressions of quantities. We then use these capabilities to further develop algorithms to assist reasoning in the context of the aforementioned tasks.

169 sitasi en Computer Science

Detail DOI Sumber

DOAJ Open Access 2020

Cучасний український драматургійний дискурс: питання внутрішньої і зовнішньої комунікації

Анатолій Мойсієнко

Рецензія на: Корольова В. Комунікативно-прагматична організація сучасного українського драматургійного дискурсу : монографія. Дніпро : Ліра, 2016. 382 с. Інформація про автора: Мойсієнко Анатолій Кирилович – доктор філологічних наук, професор, завідувач кафедри української мови та прикладної лінгвістики Інституту філології Київського національного університету імені Тараса Шевченка (Україна). Електронна адреса: anmoj@ukr.net

Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2020

Learning synergies based in-hand manipulation with reward shaping

Zhen Deng, Jian Wei Zhang

In-hand manipulation is a fundamental ability for multi-fingered robotic hands that interact with their environments. Owing to the high dimensionality of robotic hands and intermittent contact dynamics, effectively programming a robotic hand for in-hand manipulations is still a challenging problem. To address this challenge, this work employs deep reinforcement learning (DRL) algorithm to learn in-hand manipulations for multi-fingered robotic hands. A reward-shaping method is proposed to assist the learning of in-hand manipulation. The synergy of robotic hand postures is analysed to build a low-dimensional hand posture space. Two additional rewards are designed based on both the analysis of hand synergies and its learning history. The two additional rewards cooperating with an extrinsic reward are used to assist the in-hand manipulation learning. Three value functions are trained jointly with respect to their reward functions. Then they cooperate to optimise a control policy for in-hand manipulation. The reward shaping not only improves the exploration efficiency of the DRL algorithm but also provides a way to incorporate domain knowledge. The performance of the proposed learning method is evaluated with object rotation tasks. Experimental results demonstrated that the proposed learning method enables multi-fingered robotic hands to learn in-hand manipulation effectively.

Computational linguistics. Natural language processing, Computer software

Detail DOI Sumber

S2 Open Access 2015

Last Words: Computational Linguistics and Deep Learning

Christopher D. Manning

162 sitasi en Computer Science, History

Detail DOI Sumber

Hasil untuk "Computational linguistics. Natural language processing"