Deep learning in clinical natural language processing: a methodical review
Stephen T Wu, Kirk Roberts, Surabhi Datta
et al.
OBJECTIVE This article methodically reviews the literature on deep learning (DL) for natural language processing (NLP) in the clinical domain, providing quantitative analysis to answer 3 research questions concerning methods, scope, and context of current research. MATERIALS AND METHODS We searched MEDLINE, EMBASE, Scopus, the Association for Computing Machinery Digital Library, and the Association for Computational Linguistics Anthology for articles using DL-based approaches to NLP problems in electronic health records. After screening 1,737 articles, we collected data on 25 variables across 212 papers. RESULTS DL in clinical NLP publications more than doubled each year, through 2018. Recurrent neural networks (60.8%) and word2vec embeddings (74.1%) were the most popular methods; the information extraction tasks of text classification, named entity recognition, and relation extraction were dominant (89.2%). However, there was a "long tail" of other methods and specific tasks. Most contributions were methodological variants or applications, but 20.8% were new methods of some kind. The earliest adopters were in the NLP community, but the medical informatics community was the most prolific. DISCUSSION Our analysis shows growing acceptance of deep learning as a baseline for NLP research, and of DL-based NLP in the medical community. A number of common associations were substantiated (eg, the preference of recurrent neural networks for sequence-labeling named entity recognition), while others were surprisingly nuanced (eg, the scarcity of French language clinical NLP with deep learning). CONCLUSION Deep learning has not yet fully penetrated clinical NLP and is growing rapidly. This review highlighted both the popular and unique trends in this active field.
411 sitasi
en
Computer Science, Medicine
The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing
Rotem Dror, G. Baumer, Segev Shlomov
et al.
Statistical significance testing is a standard statistical tool designed to ensure that experimental results are not coincidental. In this opinion/ theoretical paper we discuss the role of statistical significance testing in Natural Language Processing (NLP) research. We establish the fundamental concepts of significance testing and discuss the specific aspects of NLP tasks, experimental setups and evaluation measures that affect the choice of significance tests in NLP research. Based on this discussion we propose a simple practical protocol for statistical significance test selection in NLP setups and accompany this protocol with a brief survey of the most relevant tests. We then survey recent empirical papers published in ACL and TACL during 2017 and show that while our community assigns great value to experimental results, statistical significance testing is often ignored or misused. We conclude with a brief discussion of open issues that should be properly addressed so that this important tool can be applied. in NLP research in a statistically sound manner.
434 sitasi
en
Computer Science
Continual Lifelong Learning in Natural Language Processing: A Survey
Magdalena Biesialska, Katarzyna Biesialska, M. Costa-jussà
Continual learning (CL) aims to enable information systems to learn from a continuous data stream across time. However, it is difficult for existing deep learning architectures to learn a new task without largely forgetting previously acquired knowledge. Furthermore, CL is particularly challenging for language learning, as natural language is ambiguous: it is discrete, compositional, and its meaning is context-dependent. In this work, we look at the problem of CL through the lens of various NLP tasks. Our survey discusses major challenges in CL and current methods applied in neural network models. We also provide a critical review of the existing CL evaluation methods and datasets in NLP. Finally, we present our outlook on future research directions.
260 sitasi
en
Computer Science
Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview
Deven Santosh Shah, H. A. Schwartz, Dirk Hovy
An increasing number of natural language processing papers address the effect of bias on predictions, introducing mitigation techniques at different parts of the standard NLP pipeline (data and models). However, these works have been conducted individually, without a unifying framework to organize efforts within the field. This situation leads to repetitive approaches, and focuses overly on bias symptoms/effects, rather than on their origins, which could limit the development of effective countermeasures. In this paper, we propose a unifying predictive bias framework for NLP. We summarize the NLP literature and suggest general mathematical definitions of predictive bias. We differentiate two consequences of bias: outcome disparities and error disparities, as well as four potential origins of biases: label bias, selection bias, model overamplification, and semantic bias. Our framework serves as an overview of predictive bias in NLP, integrating existing work into a single structure, and providing a conceptual baseline for improved frameworks.
287 sitasi
en
Computer Science
OPTIMISER LE DÉVELOPPEMENT DES COMPÉTENCES ORALES DES ÉTUDIANTS : UNE APPROCHE PÉDAGOGIQUE AXÉE SUR L'ENGAGEMENT ACTIF DANS LES COURS DE LANGUE
DEMBRI Naima
Résumé : S’intéresser à l’agir de l’apprenant en pédagogie, c’est s’intéresser aux capacités des apprenants à construire eux-mêmes leurs savoirs et à agir sur leur apprentissage tout en interagissant avec les autres acteurs dans une perspective vygotskienne. A cet égard, nous pensons que toute démarche d’enseignement /apprentissage d’une compétence verbale et non verbale doit se fonder sur « l’état de développement psychologique des élèves » et qu’elle doit notamment s’articuler avec les compétences intellectuelles, les capacités d’action et les capacités verbale des apprenants.
Mots-clés : Agir de l’apprenant- savoir agir- compétence verbale- compétences comportementales- métacognition - interaction- motivation- pouvoir agir-
Arts in general, Computational linguistics. Natural language processing
TRANSFORMATION NUMÉRIQUE DES ADMINISTRATIONS EN CÔTE D’IVOIRE : RÔLE STRATÉGIQUE DU CLOUD COMPUTING DANS LA GESTION DES DONNÉES PUBLIQUES
Jean-Michel Kouakou Kan N’GUESSAN
Résumé : Cet article analyse le rôle stratégique du cloud computing dans la transformation numérique des administrations ivoiriennes. Il met en évidence les enjeux et perspectives liés à la gestion des données publiques, en explorant les opportunités (efficacité, sécurité, continuité opérationnelle) et les obstacles (infrastructures limitées, compétences insuffisantes, coûts perçus). À travers une approche qualitative combinant entretiens et analyse documentaire, l’étude identifie les facteurs facilitants, notamment les partenariats technologiques, la modernisation des infrastructures et la formation du personnel. Enfin, des recommandations stratégiques sont proposées pour une adoption réussie du cloud computing dans le secteur public ivoirien.
Mots-clés : Cloud computing, transformation numérique, gestion des données, infrastructures technologiques, cybersécurité
Arts in general, Computational linguistics. Natural language processing
Reseña de Asensio Pastor, Margarita Isabel (dir. y ed.). (2024). La investigación cualitativa en Didáctica de la Lengua y la Literatura. Editorial Comares
Isabel Cristina Alfonzo de Tovar
Computational linguistics. Natural language processing, Technology
Utilisation des technologies numériques dans les programmes d’enseignement supérieur en République Démocratique Du Congo. Enjeux et perspectives sur l’enseignement à distance dans le système LMD
Gabriel ALONGE ONOLOKE
Résumé : Au cours de ces dernières années, les systèmes éducatifs de la plupart de pays du monde ont changé, des pratiques plus créatives et plus collaboratives n’ont pas également cessé de naître à perpétuité, les connaissances et les affinités entre les technologies numériques et la pédagogie ont à leur tour progressé d’une façon spectaculaire. Dans cette mouvance, le système d’Enseignement Supérieur et Universitaire congolais n’est pas resté hors marge, avec le basculement au système LMD qui est arrivé avec son cortège d’exigences technologiques. C’est dans cette optique, se référant au cadre légal et règlementaire existant, que cet article se veut une sorte d’évaluation des enjeux abscons de l’intégration de l’enseignement à distance dans le système LMD, au sein des instituts supérieurs et universités de la République Démocratique du Congo, dans le but de proposer des voies de sortie responsables à sa réussite.
Mots clés : Technologies numériques, enseignement supérieur, enseignement à distance, système LMD, enjeux et perspectives.
Arts in general, Computational linguistics. Natural language processing
Contrastive Data and Learning for Natural Language Processing
Rui Zhang, Yangfeng Ji, Yue Zhang
et al.
Current NLP models heavily rely on effective representation learning algorithms. Contrastive learning is one such technique to learn an embedding space such that similar data sample pairs have close representations while dissimilar samples stay far apart from each other. It can be used in supervised or unsupervised settings using different loss functions to produce task-specific or general-purpose representations. While it has originally enabled the success for vision tasks, recent years have seen a growing number of publications in contrastive NLP. This first line of works not only delivers promising performance improvements in various NLP tasks, but also provides desired characteristics such as task-agnostic sentence representation, faithful text generation, data-efficient learning in zero-shot and few-shot settings, interpretability and explainability. In this tutorial, we aim to provide a gentle introduction to the fundamentals of contrastive learning approaches and the theory behind them. We then survey the benefits and the best practices of contrastive learning for various downstream NLP applications including Text Classification, Question Answering, Summarization, Text Generation, Interpretability and Explainability, Commonsense Knowledge and Reasoning, Vision-and-Language.This tutorial intends to help researchers in the NLP and computational linguistics community to understand this emerging topic and promote future research directions of using contrastive learning for NLP applications.
Consonant mutation in Nzema and Esahie
Samuel Amoh, John Nyame, Isaac Nyarko
et al.
This study examines consonant mutation particularly in Esahie[1] and Nzema. The reason for the study arises from the mutuality in the two languages. Again, previous studies have superficially treated this subject in the separate languages. However, the subject of consonant mutation is common in the above languages, yet the prevailing characteristics of the phenomenon are similar and dissimilar in the languages. Thus, this study compares and contrast consonant mutation in the two related languages to establish a correlation. In this vein, it discusses the mutational pattern, directionalities and voicing. Data for this study are assembled from four native speakers of each of the languages and literature of the respective languages. The data are analysed within the purview of Distinctive Feature Theory. The study identifies evidence to the relational effect that, the phonological environments in which the various realizations of mutation occurs also results in a harmony system. However, they differ in their mutational domains. The featural agreement normally forges between the vowel-consonant. Again, it is observed in Nzema and Esahie that, the common feature responsible for mutation in the alternant pairs, [k/x] and [k/g] is [dorsal]; [d/l] and [d/n] is [coronal]; [ʨ/ɕ] is [+strident] and [b/m] is [labial]. In both languages, harmonic assimilation is bidirectional.
[1] Esahie is also known as Sefwi, Sehwi, or Asahyue. In Ghana, the language is classified as part of the linguistic Akan group of the larger Akan dialects, even though it shares very close intelligibility with Nzema (a solely ethnographic Akan) than Twi and Mfantse (which are both linguistic and ethnographic Akan). Therefore, in this paper, we shall refer to both variants as separate languages.
Computational linguistics. Natural language processing, Language. Linguistic theory. Comparative grammar
The Operation Sequence Model—Combining N-Gram-Based and Phrase-Based Statistical Machine Translation
Nadir Durrani, Helmut Schmid, Alexander Fraser
et al.
Computational linguistics. Natural language processing
EVALITA 2020: Overview of the 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian
Valerio Basile, D. Croce, Maria Di Maro
et al.
The Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA) is the biennial initiative aimed at promoting the development of language and speech technologies for the Italian language. EVALITA is promoted by the Italian Association of Computational Linguistics (AILC)1 and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA)2 and the Italian Association for Speech Sciences (AISV)3. EVALITA provides a shared framework where different systems and approaches can be scientifically evaluated and compared with each other with respect to a large variety of tasks, suggested and organized by the Italian research community. The proposed tasks represent scientific challenges where methods, resources, and systems can be tested against shared benchmarks representing linguistic open issues or real world applications, possibly in a multilingual and/or multi-modal perspective. The collected data sets provide big opportunities for scientists to explore old and new problems concerning NLP in Italian as well as to develop solutions and to discuss the NLP-related issues within the community. Some tasks are traditionally present in the evaluation campaign, while others are completely new. This paper introduces the tasks proposed at EVALITA 2020 and provides an overview to the participants and systems whose descriptions and obtained results are reported in these Proceedings4. The EVALITA 2020 edition, held online on December 17th due to the COVID-19 pandemic, counts 14 different tasks. In particular, the selected tasks are grouped in five research areas (tracks) according to their objective and characteristics, namely (i) Affect, Hate, and Stance, (ii) Creativity and Style, (iii) New Challenges in Long-standing Tasks, (iv) Semantics and Multimodality, (v) Time and Diachrony. This edition was highly participated, with 51 groups whose participants have affiliation in 14 countries. Although EVALITA is generally promoted and targeted to the Italian research community, this edition saw an international participation, also thanks to the fact that several Italian researchers working in different countries contributed to the organization of the tasks or participated in them as authors. This overview is organized as follows: in Section 2 a brief description of the tasks belonging to the various areas is reported. Section 3 discusses the participation to the workshop referred to several aspects, from the research area, to the affiliation of authors. Section 4 describes the criteria used to assign the best system across tasks award, made by an ad-hoc committee starting from the suggestions of task organizers and reviewers. Finally, section 5 points out on both the obtained results and on the future of the workshop.
99 sitasi
en
Computer Science
Natural language processing in low back pain and spine diseases: A systematic review
Luca Bacco, F. Russo, L. Ambrosio
et al.
Natural Language Processing (NLP) is a discipline at the intersection between Computer Science (CS), Artificial Intelligence (AI), and Linguistics that leverages unstructured human-interpretable (natural) language text. In recent years, it gained momentum also in health-related applications and research. Although preliminary, studies concerning Low Back Pain (LBP) and other related spine disorders with relevant applications of NLP methodologies have been reported in the literature over the last few years. It motivated us to systematically review the literature comprised of two major public databases, PubMed and Scopus. To do so, we first formulated our research question following the PICO guidelines. Then, we followed a PRISMA-like protocol by performing a search query including terminologies of both technical (e.g., natural language and computational linguistics) and clinical (e.g., lumbar and spine surgery) domains. We collected 221 non-duplicated studies, 16 of which were eligible for our analysis. In this work, we present these studies divided into sub-categories, from both tasks and exploited models’ points of view. Furthermore, we report a detailed description of techniques used to extract and process textual features and the several evaluation metrics used to assess the performance of the NLP models. However, what is clear from our analysis is that additional studies on larger datasets are needed to better define the role of NLP in the care of patients with spinal disorders.
A comprehensive review on Arabic word sense disambiguation for natural language processing applications
S. Kaddoura, Rowanda D. Ahmed, J. D.
In communication, textual data are a vital attribute. In all languages, ambiguous or polysemous words' meaning changes depending on the context in which they are used. The ability to determine the ambiguous word's correct meaning is a Know‐distill challenging task in natural language processing (NLP). Word sense disambiguation (WSD) is an NLP process to analyze and determine the correct meaning of polysemous words in a text. WSD is a computational linguistics task that automatically identifies the polysemous word's set of senses. Based on the context some word comes into view, WSD recognizes and tags the word to its correct priori known meaning. Semitic languages like Arabic have even more significant challenges than other languages since Arabic lacks diacritics, standardization, and a massive shortage of available resources. Recently, many approaches and techniques have been suggested to solve word ambiguity dilemmas in many different ways and several languages. In this review paper, an extensive survey of research works is presented, seeking to solve Arabic word sense disambiguation with the existing AWSD datasets.
21 sitasi
en
Computer Science
A Review of Recent Work in Transfer Learning and Domain Adaptation for Natural Language Processing of Electronic Health Records
Egoitz Laparra, Aurelie Mascio, S. Velupillai
et al.
Summary Objectives: We survey recent work in biomedical NLP on building more adaptable or generalizable models, with a focus on work dealing with electronic health record (EHR) texts, to better understand recent trends in this area and identify opportunities for future research. Methods: We searched PubMed, the Institute of Electrical and Electronics Engineers (IEEE), the Association for Computational Linguistics (ACL) anthology, the Association for the Advancement of Artificial Intelligence (AAAI) proceedings, and Google Scholar for the years 2018-2020. We reviewed abstracts to identify the most relevant and impactful work, and manually extracted data points from each of these papers to characterize the types of methods and tasks that were studied, in which clinical domains, and current state-of-the-art results. Results: The ubiquity of pre-trained transformers in clinical NLP research has contributed to an increase in domain adaptation and generalization-focused work that uses these models as the key component. Most recently, work has started to train biomedical transformers and to extend the fine-tuning process with additional domain adaptation techniques. We also highlight recent research in cross-lingual adaptation, as a special case of adaptation. Conclusions: While pre-trained transformer models have led to some large performance improvements, general domain pre-training does not always transfer adequately to the clinical domain due to its highly specialized language. There is also much work to be done in showing that the gains obtained by pre-trained transformers are beneficial in real world use cases. The amount of work in domain adaptation and transfer learning is limited by dataset availability and creating datasets for new domains is challenging. The growing body of research in languages other than English is encouraging, and more collaboration between researchers across the language divide would likely accelerate progress in non-English clinical NLP.
Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning
José Camacho-Collados, Mohammad Taher Pilehvar
Word vector representations have a long tradition in several research fields, such as cognitive science or computational linguistics. They have been used to represent the meaning of various units of natural languages, including, among others, words, phrases, and sentences. Before the deep learning tsunami, count-based vector space models had been successfully used in computational linguistics to represent the semantics of natural languages. However, the rise of neural networks in NLP popularized the use of word embeddings, which are now applied as pre-trained vectors in most machine learning architectures. This book, written by Mohammad Taher Pilehvar and Jose Camacho-Collados, provides a comprehensive and easy-to-read review of the theory and advances in vector models for NLP, focusing specially on semantic representations and their applications. It is a great introduction to different types of embeddings and the background and motivations behind them. In this sense, the authors adequately present the most relevant concepts and approaches that have been used to build vector representations. They also keep track of the most recent advances of this vibrant and fast-evolving area of research, discussing cross-lingual representations and current language models based on the Transformer. Therefore, this is a useful book for researchers interested in computational methods for semantic representations and artificial intelligence. Although some basic knowledge of machine learning may be necessary to follow a few topics, the book includes clear illustrations and explanations, which make it accessible to a wide range of readers. Apart from the preface and the conclusions, the book is organized into eight chapters. In the first two, the authors introduce some of the core ideas of NLP and artificial neural networks, respectively, discussing several concepts that will be useful throughout the book. Then, Chapters 3 to 6 present different types of vector representations at the lexical level (word embeddings, graph embeddings, sense embeddings, and contextualized embeddings), followed by a brief chapter (7) about sentence and document embeddings. For each specific topic, the book includes methods and data sets to assess the quality of the embeddings. Finally, Chapter 8 raises ethical issues involved
49 sitasi
en
Computer Science
Natural Language Processing for Cybersecurity: Automating Threat Report Analysis
Ehimah Obuse, Noah Ayanbode, Emmanuel Cadet
et al.
The rapid growth of cyber threats has led to an exponential increase in threat intelligence reports, incident logs, and security advisories, creating significant challenges for timely and effective analysis. Manual examination of these unstructured text sources is labor-intensive, error-prone, and often unable to keep pace with the speed of emerging threats. Natural Language Processing (NLP) offers a transformative approach to automating threat report analysis by leveraging advanced computational linguistics and machine learning techniques to extract, classify, and contextualize critical security information. This paper presents a comprehensive study of NLP-based methods for cybersecurity threat report analysis, emphasizing their capacity to enhance situational awareness, accelerate incident response, and support proactive defense strategies. We examine key NLP tasks applicable to cybersecurity, including named entity recognition for extracting indicators of compromise (IOCs), topic modeling for identifying threat themes, sentiment analysis for assessing attacker intent, and relation extraction for mapping threat actor behaviors. State-of-the-art models such as transformer-based architectures (e.g., BERT, RoBERTa, and domain-specific adaptations like CyberBERT) are evaluated for their performance in parsing and understanding complex, jargon-rich security texts. Empirical experiments on benchmark datasetsincluding threat intelligence feeds, MITRE ATT&CK descriptions, and open-source cyber incident reportsdemonstrate that NLP-driven pipelines outperform traditional keyword-matching systems in accuracy, scalability, and adaptability to novel threats. We further discuss the integration of NLP systems with Security Information and Event Management (SIEM) platforms, enabling automated alert generation, correlation of threat indicators, and prioritization of remediation efforts. Despite these advantages, challenges remain in handling data heterogeneity, preserving contextual accuracy, and mitigating model biases. We explore emerging research directions, including low-resource domain adaptation, explainable NLP for transparent decision-making, and multilingual processing to expand threat coverage across diverse linguistic sources. The findings underscore the strategic importance of NLP in modern cybersecurity operations, highlighting its role in transforming unstructured threat intelligence into actionable, real-time security insights that strengthen defensive postures against evolving cyber adversaries.
CLD² Language Documentation Meets Natural Language Processing for Revitalising Endangered Languages
R. Zariquiey, Arturo Oncevay, Javier Vera
Language revitalisation should not be understood as a direct outcome of language documentation, which is mainly focused on the creation of language repositories. Natural language processing (NLP) offers the potential to complement and exploit these repositories through the development of language technologies that may contribute to improving the vitality status of endangered languages. In this paper, we discuss the current state of the interaction between language documentation and computational linguistics, present a diagnosis of how the outputs of recent documentation projects for endangered languages are underutilised for the NLP community, and discuss how the situation could change from both the documentary linguistics and NLP perspectives. All this is introduced as a bridging paradigm dubbed as Computational Language Documentation and Development (CLD²). CLD² calls for (1) the inclusion of NLP-friendly annotated data as a deliverable of future language documentation projects; and (2) the exploitation of language documentation databases by the NLP community to promote the computerization of endangered languages, as one way to contribute to their revitalization.
Towards Reproducible Machine Learning Research in Natural Language Processing
Ana Lucic, Maurits J. R. Bleeker, Samarth Bhargav
et al.
While recent progress in the field of ML has been significant, the reproducibility of these cutting-edge results is often lacking, with many submissions lacking the necessary information in order to ensure subsequent reproducibility. Despite proposals such as the Reproducibility Checklist and reproducibility criteria at several major conferences, the reflex for carrying out research with reproducibility in mind is lacking in the broader ML community. We propose this tutorial as a gentle introduction to ensuring reproducible research in ML, with a specific emphasis on computational linguistics and NLP. We also provide a framework for using reproducibility as a teaching tool in university-level computer science programs.
6 sitasi
en
Computer Science
Category Theory for Quantum Natural Language Processing
Alexis Toumi
This thesis introduces quantum natural language processing (QNLP) models based on a simple yet powerful analogy between computational linguistics and quantum mechanics: grammar as entanglement. The grammatical structure of text and sentences connects the meaning of words in the same way that entanglement structure connects the states of quantum systems. Category theory allows to make this language-to-qubit analogy formal: it is a monoidal functor from grammar to vector spaces. We turn this abstract analogy into a concrete algorithm that translates the grammatical structure onto the architecture of parameterised quantum circuits. We then use a hybrid classical-quantum algorithm to train the model so that evaluating the circuits computes the meaning of sentences in data-driven tasks. The implementation of QNLP models motivated the development of DisCoPy (Distributional Compositional Python), the toolkit for applied category theory of which the first chapter gives a comprehensive overview. String diagrams are the core data structure of DisCoPy, they allow to reason about computation at a high level of abstraction. We show how they can encode both grammatical structures and quantum circuits, but also logical formulae, neural networks or arbitrary Python code. Monoidal functors allow to translate these abstract diagrams into concrete computation, interfacing with optimised task-specific libraries. The second chapter uses DisCopy to implement QNLP models as parameterised functors from grammar to quantum circuits. It gives a first proof-of-concept for the more general concept of functorial learning: generalising machine learning from functions to functors by learning from diagram-like data. In order to learn optimal functor parameters via gradient descent, we introduce the notion of diagrammatic differentiation: a graphical calculus for computing the gradients of parameterised diagrams.
4 sitasi
en
Computer Science, Mathematics