Hasil "cs.CL" - JURNALIN

CrossRef Open Access 2020

Room Temperature Synthesis of Stable, Printable Cs3Cu2X5 (X = I, Br/I, Br, Br/Cl, Cl) Colloidal Nanocrystals with Near-Unity Quantum Yield Green Emitters (X = Cl)

Yanyan Li, Parth Vashishtha, Zhicong Zhou et al.

191 sitasi en

Detail DOI Sumber

CrossRef Open Access 2025

Cs4CuSb2Cl12 Nanostructured Layered Films for Humidity and Temperature Sensing Applications

Julfikar Ali Sarkar, Muhammed Junais Pulikkathumbayil, Veena V.P. et al.

1 sitasi en

Detail DOI Sumber

CrossRef Open Access 2025

Solvent-free mechanochemical access to phase-pure Cs–Co–Cl halometalates with tuneable electronic properties for energy applications

Pablo Garrido, Darío Espinoza, Karem Gallardo et al.

Solvent-free mechanochemistry enables the synthesis of CsCoCl3, Cs2CoCl4, and Cs3CoCl5 with tunable dimensionality and optoelectronic properties dictated by the Co2+ coordination environment.

en

Detail DOI Sumber

arXiv Open Access 2020

WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols

Jeniya Tabassum, Sydney Lee, Wei Xu et al.

This paper presents the results of the wet lab information extraction task at WNUT 2020. This task consisted of two sub tasks: (1) a Named Entity Recognition (NER) task with 13 participants and (2) a Relation Extraction (RE) task with 2 participants. We outline the task, data annotation process, corpus statistics, and provide a high-level overview of the participating systems for each sub task.

en cs.CL

Detail Sumber

arXiv Open Access 2020

DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis

Hu Xu, Bing Liu, Lei Shu et al.

This paper focuses on learning domain-oriented language models driven by end tasks, which aims to combine the worlds of both general-purpose language models (such as ELMo and BERT) and domain-specific language understanding. We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora. This helps in learning domain language models with low-resources. Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.

en cs.CL

Detail Sumber

arXiv Open Access 2020

Overview of CAPITEL Shared Tasks at IberLEF 2020: Named Entity Recognition and Universal Dependencies Parsing

Jordi Porta-Zamorano, Luis Espinosa-Anke

We present the results of the CAPITEL-EVAL shared task, held in the context of the IberLEF 2020 competition series. CAPITEL-EVAL consisted on two subtasks: (1) Named Entity Recognition and Classification and (2) Universal Dependency parsing. For both, the source data was a newly annotated corpus, CAPITEL, a collection of Spanish articles in the newswire domain. A total of seven teams participated in CAPITEL-EVAL, with a total of 13 runs submitted across all subtasks. Data, results and further information about this task can be found at sites.google.com/view/capitel2020.

en cs.CL

Detail Sumber

arXiv Open Access 2020

UniCase -- Rethinking Casing in Language Models

Rafal Powalski, Tomasz Stanislawek

In this paper, we introduce a new approach to dealing with the problem of case-sensitiveness in Language Modelling (LM). We propose simple architecture modification to the RoBERTa language model, accompanied by a new tokenization strategy, which we named Unified Case LM (UniCase). We tested our solution on the GLUE benchmark, which led to increased performance by 0.42 points. Moreover, we prove that the UniCase model works much better when we have to deal with text data, where all tokens are uppercased (+5.88 point).

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2020

Predicting metrical patterns in Spanish poetry with language models

Javier de la Rosa, Salvador Ros, Elena González-Blanco

In this paper, we compare automated metrical pattern identification systems available for Spanish against extensive experiments done by fine-tuning language models trained on the same task. Despite being initially conceived as a model suitable for semantic tasks, our results suggest that BERT-based models retain enough structural information to perform reasonably well for Spanish scansion.

en cs.CL

Detail Sumber

arXiv Open Access 2020

The workweek is the best time to start a family -- A Study of GPT-2 Based Claim Generation

Shai Gretz, Yonatan Bilu, Edo Cohen-Karlik et al.

Argument generation is a challenging task whose research is timely considering its potential impact on social media and the dissemination of information. Here we suggest a pipeline based on GPT-2 for generating coherent claims, and explore the types of claims that it produces, and their veracity, using an array of manual and automatic assessments. In addition, we explore the interplay between this task and the task of Claim Retrieval, showing how they can complement one another.

en cs.CL

Detail Sumber

arXiv Open Access 2020

Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation

Mitchell A. Gordon, Kevin Duh

We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance: once using general-domain data and again using in-domain data with an adapted teacher.

en cs.CL

Detail Sumber

arXiv Open Access 2020

Shared task: Lexical semantic change detection in German (Student Project Report)

Adnan Ahmad, Kiflom Desta, Fabian Lang et al.

Recent NLP architectures have illustrated in various ways how semantic change can be captured across time and domains. However, in terms of evaluation there is a lack of benchmarks to compare the performance of these systems against each other. We present the results of the first shared task on unsupervised lexical semantic change detection (LSCD) in German based on the evaluation framework proposed by Schlechtweg et al. (2019).

en cs.CL

Detail Sumber

arXiv Open Access 2020

UKARA 1.0 Challenge Track 1: Automatic Short-Answer Scoring in Bahasa Indonesia

Ali Akbar Septiandri, Yosef Ardhito Winatmoko

We describe our third-place solution to the UKARA 1.0 challenge on automated essay scoring. The task consists of a binary classification problem on two datasets | answers from two different questions. We ended up using two different models for the two datasets. For task A, we applied a random forest algorithm on features extracted using unigram with latent semantic analysis (LSA). On the other hand, for task B, we only used logistic regression on TF-IDF features. Our model results in F1 score of 0.812.

en cs.CL

Detail Sumber

arXiv Open Access 2020

Parsing Early Modern English for Linguistic Search

Seth Kulick, Neville Ryant

We investigate the question of whether advances in NLP over the last few years make it possible to vastly increase the size of data usable for research in historical syntax. This brings together many of the usual tools in NLP - word embeddings, tagging, and parsing - in the service of linguistic queries over automatically annotated corpora. We train a part-of-speech (POS) tagger and parser on a corpus of historical English, using ELMo embeddings trained over a billion words of similar text. The evaluation is based on the standard metrics, as well as on the accuracy of the query searches using the parsed data.

en cs.CL

Detail Sumber

arXiv Open Access 2020

autoNLP: NLP Feature Recommendations for Text Analytics Applications

Janardan Misra

While designing machine learning based text analytics applications, often, NLP data scientists manually determine which NLP features to use based upon their knowledge and experience with related problems. This results in increased efforts during feature engineering process and renders automated reuse of features across semantically related applications inherently difficult. In this paper, we argue for standardization in feature specification by outlining structure of a language for specifying NLP features and present an approach for their reuse across applications to increase likelihood of identifying optimal features.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2020

Federated pretraining and fine tuning of BERT using clinical notes from multiple silos

Dianbo Liu, Tim Miller

Large scale contextual representation models, such as BERT, have significantly advanced natural language processing (NLP) in recently years. However, in certain area like healthcare, accessing diverse large scale text data from multiple institutions is extremely challenging due to privacy and regulatory reasons. In this article, we show that it is possible to both pretrain and fine tune BERT models in a federated manner using clinical texts from different silos without moving the data.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2020

Boosting Naturalness of Language in Task-oriented Dialogues via Adversarial Training

Chenguang Zhu

The natural language generation (NLG) module in a task-oriented dialogue system produces user-facing utterances conveying required information. Thus, it is critical for the generated response to be natural and fluent. We propose to integrate adversarial training to produce more human-like responses. The model uses Straight-Through Gumbel-Softmax estimator for gradient computation. We also propose a two-stage training scheme to boost performance. Empirical results show that the adversarial training can effectively improve the quality of language generation in both automatic and human evaluations. For example, in the RNN-LG Restaurant dataset, our model AdvNLG outperforms the previous state-of-the-art result by 3.6% in BLEU.

en cs.CL

Detail Sumber

arXiv Open Access 2020

Designing Precise and Robust Dialogue Response Evaluators

Tianyu Zhao, Divesh Lala, Tatsuya Kawahara

Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data in https://github.com/ZHAOTING/dialog-processing.

en cs.CL

Detail Sumber

arXiv Open Access 2020

Combination of abstractive and extractive approaches for summarization of long scientific texts

Vladislav Tretyak, Denis Stepanov

In this research work, we present a method to generate summaries of long scientific documents that uses the advantages of both extractive and abstractive approaches. Before producing a summary in an abstractive manner, we perform the extractive step, which then is used for conditioning the abstractor module. We used pre-trained transformer-based language models, for both extractor and abstractor. Our experiments showed that using extractive and abstractive models jointly significantly improves summarization results and ROUGE scores.

en cs.CL

Detail Sumber

arXiv Open Access 2020

Augmenting Data for Sarcasm Detection with Unlabeled Conversation Context

Hankyol Lee, Youngjae Yu, Gunhee Kim

We present a novel data augmentation technique, CRA (Contextual Response Augmentation), which utilizes conversational context to generate meaningful samples for training. We also mitigate the issues regarding unbalanced context lengths by changing the input-output format of the model such that it can deal with varying context lengths effectively. Specifically, our proposed model, trained with the proposed data augmentation technique, participated in the sarcasm detection task of FigLang2020, have won and achieves the best performance in both Reddit and Twitter datasets.

en cs.CL

Detail Sumber

arXiv Open Access 2020

4chan & 8chan embeddings

Pierre Voué, Tom De Smedt, Guy De Pauw

We have collected over 30M messages from the publicly available /pol/ message boards on 4chan and 8chan, and compiled them into a model of toxic language use. The trained word embeddings (0.4GB) are released for free and may be useful for further study on toxic discourse or to boost hate speech detection systems: https://textgain.com/8chan.

en cs.CL

Detail Sumber

Hasil untuk "cs.CL"