Scene Text Recognition With Dual Encoders

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Despite significant advancements in scene text recognition, current models face substantial challenges, particularly when confronted with irregular text images featuring complex backgrounds, curved text, diverse fonts, and distortions. While convolutional neural network (CNN)-based text recognition networks have demonstrated commendable performance, they grapple with the aforementioned challenges. Recently, transformer-based feature extractors have exhibited advantages in global feature extraction from images, especially in the context of irregular text images. By employing self-attention, these transformers establish information connections between different parts of the image, thereby mitigating the impact of uneven character distribution. This study proposes multi-encoder scene text recognition (MESTR), a hybrid approach that combines a CNN-based and a transformer-based feature extractor. MESTR excels in simultaneously extracting local and global features from text images, ensuring the integration of both types of features to enhance performance. During training, we employed a guiding connectionist temporal classification (CTC) decoder [6] as a compensatory training strategy for the attentional decoder. Our experiments showed the efficacy of MESTR across seven benchmarks, demonstrating robust performance. In addition, ablation experiments are presented to validate the effectiveness of the proposed algorithm for scene text recognition.

Original languageEnglish
Pages (from-to)973-979
Number of pages7
JournalJournal of Institute of Control, Robotics and Systems
Volume29
Issue number12
DOIs
StatePublished - 2023

Keywords

  • convolutional neural network
  • deep learning
  • scene text recognition
  • transformer

Fingerprint

Dive into the research topics of 'Scene Text Recognition With Dual Encoders'. Together they form a unique fingerprint.

Cite this