TY - JOUR
T1 - Scene Text Recognition With Dual Encoders
AU - Wang, Yao
AU - Ha, Jong Eun
N1 - Publisher Copyright:
© ICROS 2023.
PY - 2023
Y1 - 2023
N2 - Despite significant advancements in scene text recognition, current models face substantial challenges, particularly when confronted with irregular text images featuring complex backgrounds, curved text, diverse fonts, and distortions. While convolutional neural network (CNN)-based text recognition networks have demonstrated commendable performance, they grapple with the aforementioned challenges. Recently, transformer-based feature extractors have exhibited advantages in global feature extraction from images, especially in the context of irregular text images. By employing self-attention, these transformers establish information connections between different parts of the image, thereby mitigating the impact of uneven character distribution. This study proposes multi-encoder scene text recognition (MESTR), a hybrid approach that combines a CNN-based and a transformer-based feature extractor. MESTR excels in simultaneously extracting local and global features from text images, ensuring the integration of both types of features to enhance performance. During training, we employed a guiding connectionist temporal classification (CTC) decoder [6] as a compensatory training strategy for the attentional decoder. Our experiments showed the efficacy of MESTR across seven benchmarks, demonstrating robust performance. In addition, ablation experiments are presented to validate the effectiveness of the proposed algorithm for scene text recognition.
AB - Despite significant advancements in scene text recognition, current models face substantial challenges, particularly when confronted with irregular text images featuring complex backgrounds, curved text, diverse fonts, and distortions. While convolutional neural network (CNN)-based text recognition networks have demonstrated commendable performance, they grapple with the aforementioned challenges. Recently, transformer-based feature extractors have exhibited advantages in global feature extraction from images, especially in the context of irregular text images. By employing self-attention, these transformers establish information connections between different parts of the image, thereby mitigating the impact of uneven character distribution. This study proposes multi-encoder scene text recognition (MESTR), a hybrid approach that combines a CNN-based and a transformer-based feature extractor. MESTR excels in simultaneously extracting local and global features from text images, ensuring the integration of both types of features to enhance performance. During training, we employed a guiding connectionist temporal classification (CTC) decoder [6] as a compensatory training strategy for the attentional decoder. Our experiments showed the efficacy of MESTR across seven benchmarks, demonstrating robust performance. In addition, ablation experiments are presented to validate the effectiveness of the proposed algorithm for scene text recognition.
KW - convolutional neural network
KW - deep learning
KW - scene text recognition
KW - transformer
UR - https://www.scopus.com/pages/publications/85180468479
U2 - 10.5302/J.ICROS.2023.23.0146
DO - 10.5302/J.ICROS.2023.23.0146
M3 - Article
AN - SCOPUS:85180468479
SN - 1976-5622
VL - 29
SP - 973
EP - 979
JO - Journal of Institute of Control, Robotics and Systems
JF - Journal of Institute of Control, Robotics and Systems
IS - 12
ER -