DOAJ Open Access 2025

Transforming Product Discovery and Interpretation Using Vision–Language Models

Simona-Vasilica Oprea Adela Bâra

Abstrak

In this work, the utility of multimodal vision–language models (VLMs) for visual product understanding in e-commerce is investigated, focusing on two complementary models: ColQwen2 (<i>vidore/colqwen2-v1.0</i>) and ColPali (<i>vidore/colpali-v1.2-hf)</i>. These models are integrated into two architectures and evaluated across various product interpretation tasks, including image-grounded question answering, brand recognition and visual retrieval based on natural language prompts. ColQwen2, built on the Qwen2-VL backbone with LoRA-based adapter hot-swapping, demonstrates strong performance, allowing end-to-end image querying and text response synthesis. It excels at identifying attributes such as brand, color or usage based solely on product images and responds fluently to user questions. In contrast, ColPali, which utilizes the PaliGemma backbone, is optimized for explainability. It delivers detailed visual-token alignment maps that reveal how specific regions of an image contribute to retrieval decisions, offering transparency ideal for diagnostics or educational applications. Through comparative experiments using footwear imagery, it is demonstrated that ColQwen2 is highly effective in generating accurate responses to product-related questions, while ColPali provides fine-grained visual explanations that reinforce trust and model accountability.

Topik & Kata Kunci

Penulis (2)

S

Simona-Vasilica Oprea

A

Adela Bâra

Format Sitasi

Oprea, S., Bâra, A. (2025). Transforming Product Discovery and Interpretation Using Vision–Language Models. https://doi.org/10.3390/jtaer20030191

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.3390/jtaer20030191
Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.3390/jtaer20030191
Akses
Open Access ✓