Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This article provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to many applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.
Amir Feder, Katherine A. Keith, Emaad A. Manzoor
et al.
Abstract A fundamental goal of scientific research is to learn about causal relationships. However, despite its critical role in the life and social sciences, causality has not had the same importance in Natural Language Processing (NLP), which has traditionally placed more emphasis on predictive tasks. This distinction is beginning to fade, with an emerging area of interdisciplinary research at the convergence of causal inference and language processing. Still, research on causality in NLP remains scattered across domains without unified definitions, benchmark datasets and clear articulations of the challenges and opportunities in the application of causal inference to the textual domain, with its unique properties. In this survey, we consolidate research across academic areas and situate it in the broader NLP landscape. We introduce the statistical challenge of estimating causal effects with text, encompassing settings where text is used as an outcome, treatment, or to address confounding. In addition, we explore potential uses of causal inference to improve the robustness, fairness, and interpretability of NLP models. We thus provide a unified overview of causal inference for the NLP community.1
Marcos Vinícius Treviso, Tianchu Ji, Ji-Ung Lee
et al.
Abstract Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.
Minh Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh
et al.
We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory usage and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages. Our toolkit along with pretrained models and code are publicly available at: https://github.com/nlp-uoregon/trankit. A demo website for our toolkit is also available at: http://nlp.uoregon.edu/trankit. Finally, we create a demo video for Trankit at: https://youtu.be/q0KGP3zGjGc.
Multi-task learning (MTL) has become increasingly popular in natural language processing (NLP) because it improves the performance of related tasks by exploiting their commonalities and differences. Nevertheless, it is still not understood very well how multi-task learning can be implemented based on the relatedness of training tasks. In this survey, we review recent advances of multi-task learning methods in NLP, with the aim of summarizing them into two general multi-task training methods based on their task relatedness: (i) joint training and (ii) multi-step training. We present examples in various NLP downstream applications, summarize the task relationships and discuss future directions of this promising topic.
David Fraile Navarro, K. Ijaz, Dana Rezazadegan
et al.
BACKGROUND Natural Language Processing (NLP) applications have developed over the past years in various fields including its application to clinical free text for named entity recognition and relation extraction. However, there has been rapid developments the last few years that there's currently no overview of it. Moreover, it is unclear how these models and tools have been translated into clinical practice. We aim to synthesize and review these developments. METHODS We reviewed literature from 2010 to date, searching PubMed, Scopus, the Association of Computational Linguistics (ACL), and Association of Computer Machinery (ACM) libraries for studies of NLP systems performing general-purpose (i.e., not disease- or treatment-specific) information extraction and relation extraction tasks in unstructured clinical text (e.g., discharge summaries). RESULTS We included in the review 94 studies with 30 studies published in the last three years. Machine learning methods were used in 68 studies, rule-based in 5 studies, and both in 22 studies. 63 studies focused on Named Entity Recognition, 13 on Relation Extraction and 18 performed both. The most frequently extracted entities were "problem", "test" and "treatment". 72 studies used public datasets and 22 studies used proprietary datasets alone. Only 14 studies defined clearly a clinical or information task to be addressed by the system and just three studies reported its use outside the experimental setting. Only 7 studies shared a pre-trained model and only 8 an available software tool. DISCUSSION Machine learning-based methods have dominated the NLP field on information extraction tasks. More recently, Transformer-based language models are taking the lead and showing the strongest performance. However, these developments are mostly based on a few datasets and generic annotations, with very few real-world use cases. This may raise questions about the generalizability of findings, translation into practice and highlights the need for robust clinical evaluation.
ABSTRACT The challenge of optimising multimodal functions within high‐dimensional domains constitutes a notable difficulty in evolutionary computation research. Addressing this issue, this study introduces the Deep Backtracking Bare‐Bones Particle Swarm Optimisation (DBPSO) algorithm, an innovative approach built upon the integration of the Deep Memory Storage Mechanism (DMSM) and the Dynamic Memory Activation Strategy (DMAS). The DMSM enhances the memory retention for the globally optimal particle, promoting interaction between standard particles and their historically optimal counterparts. In parallel, DMAS assures the updated position of the globally optimal particle is appropriately aligned with the deep memory repository. The efficacy of DBPSO was rigorously assessed through a series of simulations employing the CEC2017 benchmark suite. A comparative analysis juxtaposed DBPSO's performance against five contemporary evolutionary algorithms across two experimental conditions: Dimension‐50 and Dimension‐100. In the 50D trials, DBPSO attained an average ranking of 2.03, whereas in the 100D scenarios, it improved to an average ranking of 1.9. Further examination utilising the CEC2019 benchmark functions revealed DBPSO's robustness, securing four first‐place finishes, three second‐place standings, and three third‐place positions, culminating in an unmatched average ranking of 1.9 across all algorithms. These empirical results corroborate DBPSO's proficiency in delivering precise solutions for complex, high‐dimensional optimisation challenges.
Computational linguistics. Natural language processing, Computer software
Abstract Large Language Models have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym RELIES, which encapsulates six major facets where linguistics contributes to NLP: Resources, Evaluation, Low-resource settings, Interpretability, Explanation, and the Study of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-à-vis systems of human language.
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. This adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose, caption-style alignment data. In this paper, we propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds. We further extend our approach to multi-audio scenarios, enabling the model to either explain differences between audio inputs or produce unified captions that describe all inputs, thereby enhancing audio-language alignment. We refer to the entire ALLM training framework as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance on audio understanding and reasoning benchmarks, as well as instruction-following skills. Moreover, incorporating multi-audio training further enhances the model's comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to developing ALLMs.
Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme's length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.
Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.
Summary Objectives : Large language models (LLMs) are revolutionizing the natural language pro-cessing (NLP) landscape within healthcare, prompting the need to synthesize the latest ad-vancements and their diverse medical applications. We attempt to summarize the current state of research in this rapidly evolving space. Methods : We conducted a review of the most recent studies on biomedical NLP facilitated by LLMs, sourcing literature from PubMed, the Association for Computational Linguistics Anthology, IEEE Explore, and Google Scholar (the latter particularly for preprints). Given the ongoing exponential growth in LLM-related publications, our survey was inherently selective. We attempted to abstract key findings in terms of (i) LLMs customized for medical texts, and (ii) the type of medical text being leveraged by LLMs, namely medical literature, electronic health records (EHRs), and social media. In addition to technical details, we touch upon topics such as privacy, bias, interpretability, and equitability. Results : We observed that while general-purpose LLMs (e.g., GPT-4) are most popular, there is a growing trend in training or customizing open-source LLMs for specific biomedi-cal texts and tasks. Several promising open-source LLMs are currently available, and appli-cations involving EHRs and biomedical literature are more prominent relative to noisier data sources such as social media. For supervised classification and named entity recogni-tion tasks, traditional (encoder only) transformer-based models still outperform new-age LLMs, and the latter are typically suited for few-shot settings and generative tasks such as summarization. There is still a paucity of research on evaluation, bias, privacy, reproduci-bility, and equitability of LLMs. Conclusions : LLMs have the potential to transform NLP tasks within the broader medical domain. While technical progress continues, biomedical application focused research must prioritize aspects not necessarily related to performance such as task-oriented evaluation, bias, and equitable use.
Mohamed Abdalla, Jan Philip Wahle, Terry Ruas
et al.
Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence in the NLP community over time. Using a corpus with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP publication authors, we explore the industry presence in the field since the early 90s. We find that industry presence among NLP authors has been steady before a steep increase over the past five years (180% growth from 2017 to 2022). A few companies account for most of the publications and provide funding to academic researchers through grants and internships. Our study shows that the presence and impact of the industry on natural language processing research are significant and fast-growing. This work calls for increased transparency of industry influence in the field.
Résumé : Notre étude porte sur le périodique scientifique intitulé Annales Africaines de Médecine. Par son abondante production scientifique, ce périodique constitue un véritable vecteur de diffusion de l’information scientifique et un outil de valorisation des connaissances, établissant ainsi un pont entre la recherche et l’application. Ce périodique a porté successivement le titre des Annales de la Faculté de Médecine à sa naissance jusqu’à sa disparition, pour être rebaptisé en 2007 Annales Africaines de Médecine. Les annales africaines de médecine sont devenues la vitrine de divers groupes de chercheurs, instrument de communication, un moyen de communication d’institutions universitaires, qui par la publication de leurs articles mettent en valeur leurs travaux d’investigation. La production mentale d’un article résulte d’un effort de cogitation intellectuelle, de créativité qui est l’œuvre originale de l’auteur, du chercheur scientifique, bref de tous ceux qui pensent, cogitent et imaginent pour créer. La production matérielle du périodique est l’œuvre de l’éditeur et de l’imprimeur qui actionne ses presses rotatives pour la réalisation et l’impression de l’ouvrage. S’il est vrai que le fait, la farine, le sucre etc., sont des biens essentiels pour la satisfaction des besoins matériels et vitaux de l’homme, il n’en demeure pas moins vrai que le périodique qu’il soit scientifique, technique ou religieux est un autre bien d’une importance considérable pour le développement d’une nation. C’est la nourriture de l’esprit indispensable à l’évolution des structures mentales et intellectuelles.
Mots-clés : Périodique, Publications en série, Bibliothéconomie, Bibliothèques universitaires, Faculté de médecine
Arts in general, Computational linguistics. Natural language processing
This position paper investigates the potential of integrating insights from language impairment research and its clinical treatment to develop human-inspired learning strategies and evaluation frameworks for language models (LMs). We inspect the theoretical underpinnings underlying some influential linguistically motivated training approaches derived from neurolinguistics and, particularly, aphasiology, aimed at enhancing the recovery and generalization of linguistic skills in aphasia treatment, with a primary focus on those targeting the syntactic domain. We highlight how these insights can inform the design of rigorous assessments for LMs, specifically in their handling of complex syntactic phenomena, as well as their implications for developing human-like learning strategies, aligning with efforts to create more sustainable and cognitively plausible natural language processing (NLP) models.
Leonidas Gee, Andrea Zugarini, Leonardo Rigutini
et al.
Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance.
A myriad of different Large Language Models (LLMs) face a common challenge in contextually analyzing table question-answering tasks. These challenges are engendered from (1) finite context windows for large tables, (2) multi-faceted discrepancies amongst tokenization patterns against cell boundaries, and (3) various limitations stemming from data confidentiality in the process of using external models such as gpt-3.5-turbo. We propose a cooperative game dubbed "HiddenTables" as a potential resolution to this challenge. In essence, "HiddenTables" is played between the code-generating LLM "Solver" and the "Oracle" which evaluates the ability of the LLM agents to solve Table QA tasks. This game is based on natural language schemas and importantly, ensures the security of the underlying data. We provide evidential experiments on a diverse set of tables that demonstrate an LLM's collective inability to generalize and perform on complex queries, handle compositional dependencies, and align natural language to programmatic commands when concrete table schemas are provided. Unlike encoder-based models, we have pushed the boundaries of "HiddenTables" to not be limited by the number of rows - therefore we exhibit improved efficiency in prompt and completion tokens. Our infrastructure has spawned a new dataset "PyQTax" that spans across 116,671 question-table-answer triplets and provides additional fine-grained breakdowns & labels for varying question taxonomies. Therefore, in tandem with our academic contributions regarding LLMs' deficiency in TableQA tasks, "HiddenTables" is a tactile manifestation of how LLMs can interact with massive datasets while ensuring data security and minimizing generation costs.
Natural language processing (NLP) is a branch of artificial intelligence, which combines computational linguistics, machine learning, and deep learning models to process human language. Although there is a surge in NLP usage across various industries in recent years, NLP has not been widely evaluated and utilized to support drug development. To demonstrate how advanced NLP can expedite the extraction and analyses of information to help address clinical pharmacology questions, inform clinical trial designs, and support drug development, three use cases are described in this article: (1) dose optimization strategy in oncology, (2) common covariates on pharmacokinetic (PK) parameters in oncology, and (3) physiologically‐based PK (PBPK) analyses for regulatory review and product label. The NLP workflow includes (1) preparation of source files, (2) NLP model building, and (3) automation of data extraction. The Clinical Pharmacology and Biopharmaceutics Summary Basis of Approval (SBA) documents, US package inserts (USPI), and approval letters from the US Food and Drug Administration (FDA) were used as our source data. As demonstrated in the three example use cases, advanced NLP can expedite the extraction and analyses of large amounts of information from regulatory review documents to help address important clinical pharmacology questions. Although this has not been adopted widely, integrating advanced NLP into the clinical pharmacology workflow can increase efficiency in extracting impactful information to advance drug development.