Hasil "cs.IR" - JURNALIN

arXiv Open Access 2025

Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation

Yifan Liu, Yaokun Liu, Zelin Li et al.

Recent advances in generative recommenders adopt a two-stage paradigm: items are first tokenized into semantic IDs using a pretrained tokenizer, and then large language models (LLMs) are trained to generate the next item via sequence-to-sequence modeling. However, these two stages are optimized for different objectives: semantic reconstruction during tokenizer pretraining versus user interaction modeling during recommender training. This objective misalignment leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to reflect diverse usage contexts; and (ii) discarded pretrained semantics, where pretrained knowledge - typically from language model embeddings - is overwritten during recommender training on user interactions. To address these limitations, we propose to learn DEcomposed COntextual Token Representations (DECOR), a unified framework that preserves pretrained semantics while enhancing the adaptability of token embeddings. DECOR introduces contextualized token composition to refine token embeddings based on user interaction context, and decomposed embedding fusion that integrates pretrained codebook embeddings with newly learned collaborative embeddings. Experiments on three real-world datasets demonstrate that DECOR consistently outperforms state-of-the-art baselines in recommendation performance. Our code will be made available upon publication.

en cs.IR, cs.AI

Detail Sumber

arXiv Open Access 2025

Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

Timur Galimzyanov, Olga Kolomyttseva, Egor Bogomolov

We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.

en cs.LG, cs.AI

Detail Sumber

arXiv Open Access 2025

Where did you get that? Towards Summarization Attribution for Analysts

Violet B, John M. Conroy, Sean Lynch et al.

Analysts require attribution, as nothing can be reported without knowing the source of the information. In this paper, we will focus on automatic methods for attribution, linking each sentence in the summary to a portion of the source text, which may be in one or more documents. We explore using a hybrid summarization, i.e., an automatic paraphrase of an extractive summary, to ease attribution. We also use a custom topology to identify the proportion of different categories of attribution-related errors.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

A comment to "A General Theory of IR Evaluation Measures"

Fernando Giner

The paper "A General Theory of IR Evaluation Measures" develops a formal framework to determine whether IR evaluation measures are interval scales. This comment shows some limitations about its conclusions.

en cs.IR

Detail Sumber

arXiv Open Access 2023

Unlocking Insights into Business Trajectories with Transformer-based Spatio-temporal Data Analysis

Muhammad Arslan, Christophe Cruz

The world of business is constantly evolving and staying ahead of the curve requires a deep understanding of market trends and performance. This article addresses this requirement by modeling business trajectories using news articles data.

en cs.IR, cs.LG

Detail Sumber

arXiv Open Access 2022

High-performance automatic categorization and attribution of inventory catalogs

Anton Kolonin

Techniques of machine learning for automatic text categorization are applied and adapted for the problem of inventory catalog data attribution, with different approaches explored and optimal solution addressing the tradeoff between accuracy and performance is selected.

en cs.IR, cs.LG

Detail Sumber

arXiv Open Access 2022

Experimenting with Selected Automated Approaches for Bias Analysis

Gizem Gezici

This work first presents our attempts to establish an automated model using state-of-the-art approaches for analysing bias in search results of Bing and Google. Experimental results indicate that the current class-wise F1-scores of our best model are not sufficient to establish an automated model for bias analysis. Thus, we decided not to continue with this approach.

en cs.IR

Detail Sumber

arXiv Open Access 2022

The Users Aren't Alright: Dangerous Mental Illness Behaviors and Recommendations

Ashlee Milton, Stevie Chancellor

In this paper, we argue that recommendation systems are in a unique position to propagate dangerous and cruel behaviors to people with mental illnesses.

en cs.IR, cs.HC

Detail Sumber

arXiv Open Access 2019

A Survey on Adversarial Information Retrieval on the Web

Saad Farooq

This survey paper discusses different forms of malicious techniques that can affect how an information retrieval model retrieves documents for a query and their remedies.

en cs.IR

Detail Sumber

arXiv Open Access 2019

Proceedings of FACTS-IR 2019

Alexandra Olteanu, Jean Garcia-Gathright, Maarten de Rijke et al.

The proceedings list for the program of FACTS-IR 2019, the Workshop on Fairness, Accountability, Confidentiality, Transparency, and Safety in Information Retrieval held at SIGIR 2019.

en cs.IR

Detail Sumber

DOAJ Open Access 2018

Applying ontologies to data integration systems for bank credit risk management

Jalil Elhassouni, Mehdi Bazzi, Abderrahim Qadi et al.

Special ISSUE VSST 2016

History of scholarship and learning. The humanities, Bibliography. Library science. Information resources

Detail DOI Sumber

DOAJ Open Access 2018

Applying ontologies to data integration systems for bank credit risk management

Jalil Elhassouni, Mehdi Bazzi, Abderrahim Qadi et al.

Special ISSUE VSST 2016 This paper proposes an ontological integration model for credit risk management. It is based on three ontologies; one is global describing credit risk management process and two other locals, the first, describes the credit granting process, and the second presents the concepts necessary for the monitoring of credit system. This paper also presents the technique used for matching between global ontology and local ontologies.

History of scholarship and learning. The humanities, Bibliography. Library science. Information resources

Detail Sumber

arXiv Open Access 2018

Visual Display and Retrieval of Music Information

Rafael Valle

This paper describes computational methods for the visual display and analysis of music information. We provide a concise description of software, music descriptors and data visualization techniques commonly used in music information retrieval. Finally, we provide use cases where the described software, descriptors and visualizations are showcased.

en cs.IR, cs.MM

Detail Sumber

arXiv Open Access 2017

A Simple Derivation of the Heap's Law from the Generalized Zipf's Law

Leonid Boytsov

I reproduce a rather simple formal derivation of the Heaps' law from the generalized Zipf's law, which I previously published in Russian.

en cs.IR

Detail Sumber

arXiv Open Access 2016

Label Visualization and Exploration in IR

Omar Alonso

There is a renaissance in visual analytics systems for data analysis and sharing, in particular, in the current wave of big data applications. We introduce RAVE, a prototype that automates the generation of an interface that uses facets and visualization techniques for exploring and analyzing relevance assessments data sets collected via crowdsourcing. We present a technical description of the main components and demonstrate its use.

en cs.IR

Detail Sumber

arXiv Open Access 2015

Using of Neuro-Indexes

Valerii Garnaga

The article describes a new data structure called neuro-index. It is an alternative to well-known file indexes. The neuro-index is fundamentally different because it stores weight coefficients in neural network. It is not a reference type like "keyword-position in a file".

en cs.IR

Detail Sumber

arXiv Open Access 2013

Evolution of the user's content: An Overview of the state of the art

Djallel Bouneffouf

The evolution of the user's content still remains a problem for an accurate recommendation.This is why the current research aims to design Recommender Systems (RS) able to continually adapt information that matches the user's interests. This paper aims to explain this problematic point in outlining the proposals that have been made in research with their advantages and disadvantages.

en cs.IR

Detail Sumber

arXiv Open Access 2009

Management Of Volatile Information In Incremental Web Crawler

Ravita Chahar, Komal Hooda, Annu Dhankhar

Paper has been withdrawn.

en cs.IR

Detail Sumber

arXiv Open Access 2007

The Haar Wavelet Transform of a Dendrogram: Additional Notes

Fionn Murtagh

We consider the wavelet transform of a finite, rooted, node-ranked, $p$-way tree, focusing on the case of binary ($p = 2$) trees. We study a Haar wavelet transform on this tree. Wavelet transforms allow for multiresolution analysis through translation and dilation of a wavelet function. We explore how this works in our tree context.

en cs.IR

Detail Sumber

arXiv Open Access 2007

Stanford Matrix Considered Harmful

Sebastiano Vigna

This note argues about the validity of web-graph data used in the literature.

en cs.IR

Detail Sumber

Hasil untuk "cs.IR"