arXiv Open Access 2025

Optimizing Product Deduplication in E-Commerce with Multimodal Embeddings

Aysenur Kulunk Berk Taskin M. Furkan Eseoglu H. Bahadir Sahin
Lihat Sumber

Abstrak

In large scale e-commerce marketplaces, duplicate product listings frequently cause consumer confusion and operational inefficiencies, degrading trust on the platform and increasing costs. Traditional keyword-based search methodologies falter in accurately identifying duplicates due to their reliance on exact textual matches, neglecting semantic similarities inherent in product titles. To address these challenges, we introduce a scalable, multimodal product deduplication designed specifically for the e-commerce domain. Our approach employs a domain-specific text model grounded in BERT architecture in conjunction with MaskedAutoEncoders for image representations. Both of these architectures are augmented with dimensionality reduction techniques to produce compact 128-dimensional embeddings without significant information loss. Complementing this, we also developed a novel decider model that leverages both text and image vectors. By integrating these feature extraction mechanisms with Milvus, an optimized vector database, our system can facilitate efficient and high-precision similarity searches across extensive product catalogs exceeding 200 million items with just 100GB of system RAM consumption. Empirical evaluations demonstrate that our matching system achieves a macro-average F1 score of 0.90, outperforming third-party solutions which attain an F1 score of 0.83. Our findings show the potential of combining domain-specific adaptations with state-of-the-art machine learning techniques to mitigate duplicate listings in large-scale e-commerce environments.

Topik & Kata Kunci

Penulis (4)

A

Aysenur Kulunk

B

Berk Taskin

M

M. Furkan Eseoglu

H

H. Bahadir Sahin

Format Sitasi

Kulunk, A., Taskin, B., Eseoglu, M.F., Sahin, H.B. (2025). Optimizing Product Deduplication in E-Commerce with Multimodal Embeddings. https://arxiv.org/abs/2509.15858

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓