Ngozichukwuka Onah, Nadine Steinmetz, Hani Al-Sayeh
et al.
The amount of text generated daily on social media is gigantic and analyzing this text is useful for many purposes. To understand what lies beneath a huge amount of text, we need dependable and effective computing techniques from self-powered topic models. Nevertheless, there are currently relatively few thorough quantitative comparisons between these models. In this study, we compare these models and propose an assessment metric that documents how the topics change in time.
This paper details an empirical investigation into using Graph Contrastive Learning (GCL) to generate mathematical equation representations, a critical aspect of Mathematical Information Retrieval (MIR). Our findings reveal that this simple approach consistently exceeds the performance of the current leading formula retrieval model, TangentCFT. To support ongoing research and development in this field, we have made our source code accessible to the public at https://github.com/WangPeiSyuan/GCL-Formula-Retrieval/.
This paper introduces the concept of accessibility from the field of transportation planning and adopts it within the context of Information Retrieval (IR). An analogy is drawn between the fields, which motivates the development of document accessibility measures for IR systems. Considering the accessibility of documents within a collection given an IR System provides a different perspective on the analysis and evaluation of such systems which could be used to inform the design, tuning and management of current and future IR systems.
This paper analyses LightGCN in the context of graph recommendation algorithms. Despite the initial design of Graph Convolutional Networks for graph classification, the non-linear operations are not always essential. LightGCN enables linear propagation of embeddings, enhancing performance. We reproduce the original findings, assess LightGCN's robustness on diverse datasets and metrics, and explore Graph Diffusion as an augmentation of signal propagation in LightGCN.
Baadr Suleman M Alwheepy, Leandros Maglaras, Nick Ayres
Due to the high number of users on social media and the massive amounts of queries requested every second to share a new video, picture, or message, social platforms struggle to manage this humungous amount of data that is endlessly coming in. HFTCT relies on wordlists to classify opinions. It can carry out its tasks reasonably well; however, sometimes, the wordlists themselves fail to be reliable as they are a limited source of positive and negative words.
In this study, we investigate interaction-based neural matching models for ad-hoc cross-lingual information retrieval (CLIR) using cross-lingual word embeddings (CLWEs). With experiments conducted on the CLEF collection over four language pairs, we evaluate and provide insight into different neural model architectures, different ways to represent query-document interactions and word-pair similarity distributions in CLIR. This study paves the way for learning an end-to-end CLIR system using CLWEs.
Recent innovations in Transformer-based ranking models have advanced the state-of-the-art in information retrieval. However, these Transformers are computationally expensive, and their opaque hidden states make it hard to understand the ranking process. In this work, we modularize the Transformer ranker into separate modules for text representation and interaction. We show how this design enables substantially faster ranking using offline pre-computed representations and light-weight online interactions. The modular design is also easier to interpret and sheds light on the ranking process in Transformer rankers.
We propose a sentiment classification method with a general machine learning framework. For feature representation, n-gram IDF is used to extract software-engineering-related, dataset-specific, positive, neutral, and negative n-gram expressions. For classifiers, an automated machine learning tool is used. In the comparison using publicly available datasets, our method achieved the highest F1 values in positive and negative sentences on all datasets.
We demonstrate that a graph-based search algorithm-relying on the construction of an approximate neighborhood graph-can directly work with challenging non-metric and/or non-symmetric distances without resorting to metric-space mapping and/or distance symmetrization, which, in turn, lead to substantial performance degradation. Although the straightforward metrization and symmetrization is usually ineffective, we find that constructing an index using a modified, e.g., symmetrized, distance can improve performance. This observation paves a way to a new line of research of designing index-specific graph-construction distance functions.
There has been a recent trend to migrate IT infrastructure into the cloud. In this paper, we discuss the impact of this trend on searching for textual and other data, i.e. the distributed indexing and retrieval of information, from an organizational context. Keywords: information retrieval (IR); federated search; cloud search.
This paper highlights our ongoing efforts to create effective information curator recommendation models that can be personalized for individual users, while maintaining important fairness properties. Concretely, we introduce the problem of information curator recommendation, provide a high-level overview of a fairness-aware recommender, and introduce some preliminary experimental evidence over a real-world Twitter dataset. We conclude with some thoughts on future directions.
The emergence of "Fake News" and misinformation via online news and social media has spurred an interest in computational tools to combat this phenomenon. In this paper we present a new "Related Fact Checks" service, which can help a reader critically evaluate an article and make a judgment on its veracity by bringing up fact checks that are relevant to the article. We describe the core technical problems that need to be solved in building a "Related Fact Checks" service, and present results from an evaluation of an implementation.
In this paper, we present a novel structure, Semi-AutoEncoder, based on AutoEncoder. We generalize it into a hybrid collaborative filtering model for rating prediction as well as personalized top-n recommendations. Experimental results on two real-world datasets demonstrate its state-of-the-art performances.
Memory Based Collaborative Filtering is a widely used approach to provide recommendations. It exploits similarities between ratings across a population of users by forming a weighted vote to predict unobserved ratings. Bespoke solutions are frequently adopted to deal with the problem of high quality recommendations on large data sets. A disadvantage of this approach, however, is the loss of generality and flexibility of the general collaborative filtering systems. In this paper, we have developed a methodology that allows one to build a scalable and effective collaborative filtering system on top of a conventional full-text search engine such as Apache Lucene.
A click on an item is arguably the most widely used feature in recommender systems. However, a click is one out of 174 events a browser can trigger. This paper presents a framework to effectively collect and store data from event streams. A set of mining methods is provided to extract user engagement features such as: attention span, scrolling depth and visible impressions. In this work, we present an experiment where recommendations based on attention span drove 340% higher click-through-rate than clicks.
This article presents main results of the pilot study of approaches to the subject information search based on automated semantic processing of mass scientific and technical data. The authors focus on technology of building and qualification of search queries with the following filtering and ranking of search data. Software architecture, specific features of subject search and research results application are considered.
In this paper, we present a software package for the data mining of Twitter microblogs for the purpose of using them for the stock market analysis. The package is written in R langauge using apropriate R packages. The model of tweets has been considered. We have also compared stock market charts with frequent sets of keywords in Twitter microblogs messages.
Despite having a large number of speakers, the Kurdish language is among the less-resourced languages. In this work we highlight the challenges and problems in providing the required tools and techniques for processing texts written in Kurdish. From a high-level perspective, the main challenges are: the inherent diversity of the language, standardization and segmentation issues, and the lack of language resources.