arXiv Open Access 2025

SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

Pengkun Jiao Yiming Jin Jianhui Yang Chenhe Dong Zerui Huang +4 lainnya
Lihat Sumber

Abstrak

Query-product relevance prediction is vital for AI-driven e-commerce, yet current LLM-based approaches face a dilemma: SFT and DPO struggle with long-tail generalization due to coarse supervision, while traditional RLVR suffers from sparse feedback that fails to correct intermediate reasoning errors. We propose Stepwise Hybrid Examination (SHE), an RL framework that ensures logical consistency through Stepwise Reward Policy Optimization (SRPO). SRPO utilizes a hybrid reward mechanism-combining generative reward models with human-annotated verifiers-to provide fine-grained, step-level signals. To further enhance stability, SHE incorporates diversified data filtering to maintain policy entropy and a multi-stage curriculum learning protocol for progressive skill acquisition. Extensive experiments on real-world search benchmarks show that SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.

Topik & Kata Kunci

Penulis (9)

P

Pengkun Jiao

Y

Yiming Jin

J

Jianhui Yang

C

Chenhe Dong

Z

Zerui Huang

S

Shaowei Yao

X

Xiaojiang Zhou

D

Dan Ou

H

Haihong Tang

Format Sitasi

Jiao, P., Jin, Y., Yang, J., Dong, C., Huang, Z., Yao, S. et al. (2025). SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance. https://arxiv.org/abs/2510.07972

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓