TY - GEN
T1 - Scene Text Recognition with Multi-Encoders
AU - Wang, Yao
AU - Ha, Jong Eun
N1 - Publisher Copyright:
© 2022 ICROS.
PY - 2022
Y1 - 2022
N2 - Although text recognition has significantly evolved over the years, the current models still have huge challenges, especially for irregular text images, such as complex backgrounds, curved text, diverse fonts, distortions, etc. Currently, CNN-based text recognition networks have shown good performance but still face the above challenges. Recently, feature extractor based on transformer has shown excellent advantages for global feature extraction on images. Especially in irregular text images, which can use self-attention to establish the information connection of each part of the image, which can also reduce the influence of the irregular distribution of characters. Therefore, this paper proposes MESTR(Multi-Encoders Scene Text Recognition) that combines a CNN-based [1] [2] [6] feature extractor and a transformer-based feature extractor. MESTR can extract local and global features of text images at the same time and then integrate global features into local features. During training, we used CTC [6] as guide training in the decoder part, as the compensation training strategy for attentional decoder. Experimental results demonstrate that the proposed MESTR shows competitive results on all seven benchmarks. At the same time, we provide ablation experiments to show the effectiveness of the improved part on the text recognition model.
AB - Although text recognition has significantly evolved over the years, the current models still have huge challenges, especially for irregular text images, such as complex backgrounds, curved text, diverse fonts, distortions, etc. Currently, CNN-based text recognition networks have shown good performance but still face the above challenges. Recently, feature extractor based on transformer has shown excellent advantages for global feature extraction on images. Especially in irregular text images, which can use self-attention to establish the information connection of each part of the image, which can also reduce the influence of the irregular distribution of characters. Therefore, this paper proposes MESTR(Multi-Encoders Scene Text Recognition) that combines a CNN-based [1] [2] [6] feature extractor and a transformer-based feature extractor. MESTR can extract local and global features of text images at the same time and then integrate global features into local features. During training, we used CTC [6] as guide training in the decoder part, as the compensation training strategy for attentional decoder. Experimental results demonstrate that the proposed MESTR shows competitive results on all seven benchmarks. At the same time, we provide ablation experiments to show the effectiveness of the improved part on the text recognition model.
KW - Convolutional neural network
KW - Deep learning
KW - Scene text recognition
KW - Transformer
UR - https://www.scopus.com/pages/publications/85146577185
U2 - 10.23919/ICCAS55662.2022.10003838
DO - 10.23919/ICCAS55662.2022.10003838
M3 - Conference contribution
AN - SCOPUS:85146577185
T3 - International Conference on Control, Automation and Systems
SP - 1615
EP - 1620
BT - 2022 22nd International Conference on Control, Automation and Systems, ICCAS 2022
PB - IEEE Computer Society
T2 - 22nd International Conference on Control, Automation and Systems, ICCAS 2022
Y2 - 27 November 2022 through 1 December 2022
ER -