This document contains the details of the authors' submission to the proceedings of SemEval 2024's Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection Subtask A (monolingual) and B. Detection of machine-generated text is becoming an increasingly important task, with the advent of large language models (LLMs). In this paper, we lay out how using weighted averages of RoBERTa layers lets us capture information about text that is relevant to machine-generated text detection.
Enhancing AI systems with efficient communication skills for effective human assistance necessitates proactive initiatives from the system side to discern specific circumstances and interact aptly. This research focuses on a collective building assignment in the Minecraft dataset, employing language modeling to enhance task understanding through state-of-the-art methods. These models focus on grounding multi-modal understanding and task-oriented dialogue comprehension tasks, providing insights into their interpretative and responsive capabilities. Our experimental results showcase a substantial improvement over existing methods, indicating a promising direction for future research in this domain.
With the widespread adoption of Large Language Models (LLMs), in this paper we investigate the multilingual capability of these models. Our preliminary results show that, translating the native language context, question and answer into a high resource language produced the best results.
In this paper, we introduce the Behavior Structformer, a method for modeling user behavior using structured tokenization within a Transformer-based architecture. By converting tracking events into dense tokens, this approach enhances model training efficiency and effectiveness. We demonstrate its superior performance through ablation studies and benchmarking against traditional tabular and semi-structured baselines. The results indicate that structured tokenization with sequential processing significantly improves behavior modeling.
The Universal Dependencies (UD) project aims to create a cross-linguistically consistent dependency annotation for multiple languages, to facilitate multilingual NLP. It currently supports 114 languages. Dravidian languages are spoken by over 200 million people across the word, and yet there are only two languages from this family in UD. This paper examines some of the morphological and syntactic features of Dravidian languages and explores how they can be annotated in the UD framework.
This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.
This paper describes our solution for the Diarization of Speaker and Language in Conversational Environments Challenge (Displace 2023). We used a combination of VAD for finding segfments with speech, Resnet architecture based CNN for feature extraction from these segments, and spectral clustering for features clustering. Even though it was not trained with using Hindi, the described algorithm achieves the following metrics: DER 27. 1% and DER 27. 4%, on the development and phase-1 evaluation parts of the dataset, respectively.
We perform a missing, reproducible evaluation of all publicly available GPT-4 family models concerning the Document Understanding field, where it is frequently required to comprehend text spacial arrangement and visual clues in addition to textual semantics. Benchmark results indicate that though it is hard to achieve satisfactory results with text-only models, GPT-4 Vision Turbo performs well when one provides both text recognized by an external OCR engine and document images on the input. Evaluation is followed by analyses that suggest possible contamination of textual GPT-4 models and indicate the significant performance drop for lengthy documents.
Extractive summarization is a task of highlighting the most important parts of the text. We introduce a new approach to extractive summarization task using hidden clustering structure of the text. Experimental results on CNN/DailyMail demonstrate that our approach generates more accurate summaries than both extractive and abstractive methods, achieving state-of-the-art results in terms of ROUGE-2 metric exceeding the previous approaches by 10%. Additionally, we show that hidden structure of the text could be interpreted as aspects.
We demonstrate how to parse Geach's Donkey sentences in a compositional distributional model of meaning. We build on previous work on the DisCoCat (Distributional Compositional Categorical) framework, including extensions that model discourse, determiners, and relative pronouns. We present a type-logical syntax for parsing donkey sentences, for which we define both relational and vector space semantics.
Nasser Almousa, Nasser Alzamil, Abdullah Alshehri
et al.
This project introduces an advanced English-to-Arabic translator surpassing conventional tools. Leveraging the Helsinki transformer (MarianMT), our approach involves fine-tuning on a self-scraped, purely literary Arabic dataset. Evaluations against Google Translate show consistent outperformance in qualitative assessments. Notably, it excels in cultural sensitivity and context accuracy. This research underscores the Helsinki transformer's superiority for English-to-Arabic translation using a Fusha dataset.
Abstract Formamidine lead iodide perovskite (FAPbI3) is often used as a light-absorbing layer in solar cells to alleviate the energy crisis because of its good photovoltaic properties. However, its lack of stability is also an obstacle to the current development. It has been found that doping with different kinds of elements at different sites can enhance its stability and improve the photoelectric conversion efficiency of solar cells. In this study, the geometry, electronic structure, and optical properties of FA1−xCsxPbI3−yCly were calculated using Cs and Cl co-doped with FAPbI3 using first principles. The analysis revealed that the Goldschmidt factors of the doped system were between 0.962 and 0.974, indicating that the systems could maintain a stable perovskite structure and that the doped system had lower energy and a more stable structure. By calculating the energy bands, it was found that the doped ions have a more pronounced effect on the increase in the dispersion at the bottom of the conduction band than on the decrease in the dispersion at the top of the valence band of the system, and the reduction of the effective mass of carriers is more favorable for transport. As for the optical properties, the right amount of doping is favorable for the improvement of light absorption, whereas excess doping shortens the light absorption range and weakens the light absorption effect, in which FA0.875Cs0.125PbI2.958Cl0.125 has the largest light absorption coefficient. It is shown that the photoelectric properties of FAPbI3 can be effectively modulated by the co-doping with Cs and Cl, which can provide a theoretical reference for the precise preparation of more efficient solar cells.
Benjamin Scheibe, Antti J. Karttunen, Ulrich Müller
et al.
AbstractReaction of CsF with ClF3 leads to Cs[Cl3F10]. It contains a molecular, propeller‐shaped [Cl3F10]− anion with a central μ3‐F atom and three T‐shaped ClF3 molecules coordinated to it. This anion represents the first example of a heteropolyhalide anion of higher ClF3 content than [ClF4]− and is the first Cl‐containing interhalogen species with a μ‐bridging F atom. The chemical bonds to the central μ3‐F atom are highly ionic and quite weak as the bond lengths within the coordinating XF3 units (X = Cl, and also calculated for Br, I) are almost unchanged in comparison to free XF3 molecules. Cs[Cl3F10] crystallizes in a very rarely observed A[5]B[5] structure type, where cations and anions are each pseudohexagonally close packed, and reside, each with coordination number five, in the trigonal bipyramidal voids of the other.
Leslie Mareike Schoop, Roland Eger, Jürgen Nuss
et al.
We report the first examples of quinary rare earth thiophosphates with a fully ordered cation and anion distribution, Cs5Ln3X3(P2S6)2(PS4), (Ln = La, Ce and X = Br, Cl) as well as the quasi‐quaternary Cs10Y4Cl10(P2S6)3. These four new compounds crystallize in three different, unknown structure types. The yellowish, transparent, brittle Cs5Ce3Br3(P2S6)2(PS4) crystallizes in the orthorhombic space group Pnma (no. 62) with a = 13.276(3), b = 14.891(3), c = 19.593(4) Å, and V = 3873(1) Å3 in a novel structure type. Colorless crystals of Cs5La3Br3(P2S6)2(PS4) and Cs5La3Cl3(P2S6)2(PS4) are isotypic and were obtained in the monoclinic space group P21/m (no. 11) with a = 9.715(2), b = 14.310(3), c = 13.685(3) Å, β = 100.16(3)° and V = 1873(1) Å3 and a = 9.513(2), b = 14.182(3), c = 13.699(3) Å, β = 99.39(3)° and V = 1823(1) Å3, respectively. Both structures contain isolated hexathiohypodiphosphate(IV) [P2S6]4– and thiophosphate [PS4]3– units that are arranged alternately in layers. Cs10Y4Cl10(P2S6)3 crystallizes in colorless transparent platelets in the orthorhombic space group Pnnm (no. 58) with a = 13.153(3), b = 28.964(6), c = 7.780(2) Å, and V = 2964(1) Å3. The structure is composed of isolated [P4/2S6]4– octahedra containing four half occupied P positions surrounded octahedrally by sulfur. We show with Raman scattering that this disordered thiophosphate anion shows a Raman spectrum that is distinct from spectra published for other literature‐known thiophosphate anions.
This paper explores an incremental training strategy for the skip-gram model with negative sampling (SGNS) from both empirical and theoretical perspectives. Existing methods of neural word embeddings, including SGNS, are multi-pass algorithms and thus cannot perform incremental model update. To address this problem, we present a simple incremental extension of SGNS and provide a thorough theoretical analysis to demonstrate its validity. Empirical experiments demonstrated the correctness of the theoretical analysis as well as the practical usefulness of the incremental algorithm.
We present in this paper a novel framework for morpheme segmentation which uses the morpho-syntactic regularities preserved by word representations, in addition to orthographic features, to segment words into morphemes. This framework is the first to consider vocabulary-wide syntactico-semantic information for this task. We also analyze the deficiencies of available benchmarking datasets and introduce our own dataset that was created on the basis of compositionality. We validate our algorithm across datasets and present state-of-the-art results.
In this work, we present our findings and experiments for stock-market prediction using various textual sentiment analysis tools, such as mood analysis and event extraction, as well as prediction models, such as LSTMs and specific convolutional architectures.
Orna Almogi, Lena Dankin, Nachum Dershowitz
et al.
We describe the course of a hackathon dedicated to the development of linguistic tools for Tibetan Buddhist studies. Over a period of five days, a group of seventeen scholars, scientists, and students developed and compared algorithms for intertextual alignment and text classification, along with some basic language tools, including a stemmer and word segmenter.
End-to-end attention-based models have been shown to be competitive alternatives to conventional DNN-HMM models in the Speech Recognition Systems. In this paper, we extend existing end-to-end attention-based models that can be applied for Distant Speech Recognition (DSR) task. Specifically, we propose an end-to-end attention-based speech recognizer with multichannel input that performs sequence prediction directly at the character level. To gain a better performance, we also incorporate Highway long short-term memory (HLSTM) which outperforms previous models on AMI distant speech recognition task.