Learning Decomposed Contextual Token Representations from Pretrained and Collaborative Signals for Generative Recommendation
Yifan Liu, Yaokun Liu, Zelin Li
et al.
Recent advances in generative recommenders adopt a two-stage paradigm: items are first tokenized into semantic IDs using a pretrained tokenizer, and then large language models (LLMs) are trained to generate the next item via sequence-to-sequence modeling. However, these two stages are optimized for different objectives: semantic reconstruction during tokenizer pretraining versus user interaction modeling during recommender training. This objective misalignment leads to two key limitations: (i) suboptimal static tokenization, where fixed token assignments fail to reflect diverse usage contexts; and (ii) discarded pretrained semantics, where pretrained knowledge - typically from language model embeddings - is overwritten during recommender training on user interactions. To address these limitations, we propose to learn DEcomposed COntextual Token Representations (DECOR), a unified framework that preserves pretrained semantics while enhancing the adaptability of token embeddings. DECOR introduces contextualized token composition to refine token embeddings based on user interaction context, and decomposed embedding fusion that integrates pretrained codebook embeddings with newly learned collaborative embeddings. Experiments on three real-world datasets demonstrate that DECOR consistently outperforms state-of-the-art baselines in recommendation performance. Our code will be made available upon publication.
Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets
Timur Galimzyanov, Olga Kolomyttseva, Egor Bogomolov
We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.
Where did you get that? Towards Summarization Attribution for Analysts
Violet B, John M. Conroy, Sean Lynch
et al.
Analysts require attribution, as nothing can be reported without knowing the source of the information. In this paper, we will focus on automatic methods for attribution, linking each sentence in the summary to a portion of the source text, which may be in one or more documents. We explore using a hybrid summarization, i.e., an automatic paraphrase of an extractive summary, to ease attribution. We also use a custom topology to identify the proportion of different categories of attribution-related errors.
A comment to "A General Theory of IR Evaluation Measures"
Fernando Giner
The paper "A General Theory of IR Evaluation Measures" develops a formal framework to determine whether IR evaluation measures are interval scales. This comment shows some limitations about its conclusions.
Unlocking Insights into Business Trajectories with Transformer-based Spatio-temporal Data Analysis
Muhammad Arslan, Christophe Cruz
The world of business is constantly evolving and staying ahead of the curve requires a deep understanding of market trends and performance. This article addresses this requirement by modeling business trajectories using news articles data.
High-performance automatic categorization and attribution of inventory catalogs
Anton Kolonin
Techniques of machine learning for automatic text categorization are applied and adapted for the problem of inventory catalog data attribution, with different approaches explored and optimal solution addressing the tradeoff between accuracy and performance is selected.
Experimenting with Selected Automated Approaches for Bias Analysis
Gizem Gezici
This work first presents our attempts to establish an automated model using state-of-the-art approaches for analysing bias in search results of Bing and Google. Experimental results indicate that the current class-wise F1-scores of our best model are not sufficient to establish an automated model for bias analysis. Thus, we decided not to continue with this approach.
The Users Aren't Alright: Dangerous Mental Illness Behaviors and Recommendations
Ashlee Milton, Stevie Chancellor
In this paper, we argue that recommendation systems are in a unique position to propagate dangerous and cruel behaviors to people with mental illnesses.
A Survey on Adversarial Information Retrieval on the Web
Saad Farooq
This survey paper discusses different forms of malicious techniques that can affect how an information retrieval model retrieves documents for a query and their remedies.
Proceedings of FACTS-IR 2019
Alexandra Olteanu, Jean Garcia-Gathright, Maarten de Rijke
et al.
The proceedings list for the program of FACTS-IR 2019, the Workshop on Fairness, Accountability, Confidentiality, Transparency, and Safety in Information Retrieval held at SIGIR 2019.
Applying ontologies to data integration systems for bank credit risk management
Jalil Elhassouni, Mehdi Bazzi, Abderrahim Qadi
et al.
Special ISSUE VSST 2016
History of scholarship and learning. The humanities, Bibliography. Library science. Information resources
Applying ontologies to data integration systems for bank credit risk management
Jalil Elhassouni, Mehdi Bazzi, Abderrahim Qadi
et al.
Special ISSUE VSST 2016 This paper proposes an ontological integration model for credit risk management. It is based on three ontologies; one is global describing credit risk management process and two other locals, the first, describes the credit granting process, and the second presents the concepts necessary for the monitoring of credit system. This paper also presents the technique used for matching between global ontology and local ontologies.
History of scholarship and learning. The humanities, Bibliography. Library science. Information resources
Visual Display and Retrieval of Music Information
Rafael Valle
This paper describes computational methods for the visual display and analysis of music information. We provide a concise description of software, music descriptors and data visualization techniques commonly used in music information retrieval. Finally, we provide use cases where the described software, descriptors and visualizations are showcased.
A Simple Derivation of the Heap's Law from the Generalized Zipf's Law
Leonid Boytsov
I reproduce a rather simple formal derivation of the Heaps' law from the generalized Zipf's law, which I previously published in Russian.
Label Visualization and Exploration in IR
Omar Alonso
There is a renaissance in visual analytics systems for data analysis and sharing, in particular, in the current wave of big data applications. We introduce RAVE, a prototype that automates the generation of an interface that uses facets and visualization techniques for exploring and analyzing relevance assessments data sets collected via crowdsourcing. We present a technical description of the main components and demonstrate its use.
Using of Neuro-Indexes
Valerii Garnaga
The article describes a new data structure called neuro-index. It is an alternative to well-known file indexes. The neuro-index is fundamentally different because it stores weight coefficients in neural network. It is not a reference type like "keyword-position in a file".
Evolution of the user's content: An Overview of the state of the art
Djallel Bouneffouf
The evolution of the user's content still remains a problem for an accurate recommendation.This is why the current research aims to design Recommender Systems (RS) able to continually adapt information that matches the user's interests. This paper aims to explain this problematic point in outlining the proposals that have been made in research with their advantages and disadvantages.
Management Of Volatile Information In Incremental Web Crawler
Ravita Chahar, Komal Hooda, Annu Dhankhar
Paper has been withdrawn.
The Haar Wavelet Transform of a Dendrogram: Additional Notes
Fionn Murtagh
We consider the wavelet transform of a finite, rooted, node-ranked, $p$-way tree, focusing on the case of binary ($p = 2$) trees. We study a Haar wavelet transform on this tree. Wavelet transforms allow for multiresolution analysis through translation and dilation of a wavelet function. We explore how this works in our tree context.
Stanford Matrix Considered Harmful
Sebastiano Vigna
This note argues about the validity of web-graph data used in the literature.