arXiv Open Access 2023

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

Zixiao Wang Hongtao Xie Yuxin Wang Jianjun Xu Boqiang Zhang +1 lainnya
Lihat Sumber

Abstrak

In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. Different from previous CLIP-based methods mainly considering feature generalization on visual encoding, we propose a symmetrical distillation strategy (SDS) that further captures the linguistic knowledge in the CLIP text encoder. By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow that covers not only visual but also linguistic information for distillation.Benefiting from the natural alignment in CLIP, such guidance flow provides a progressive optimization objective from vision to language, which can supervise the STR feature forwarding process layer-by-layer.Besides, a new Linguistic Consistency Loss (LCL) is proposed to enhance the linguistic capability by considering second-order statistics during the optimization. Overall, CLIP-OCR is the first to design a smooth transition between image and text for the STR task.Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.Code will be available at https://github.com/wzx99/CLIPOCR.

Topik & Kata Kunci

Penulis (6)

Z

Zixiao Wang

H

Hongtao Xie

Y

Yuxin Wang

J

Jianjun Xu

B

Boqiang Zhang

Y

Yongdong Zhang

Format Sitasi

Wang, Z., Xie, H., Wang, Y., Xu, J., Zhang, B., Zhang, Y. (2023). Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition. https://arxiv.org/abs/2310.04999

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓