TY - JOUR
T1 - Exploring the effectiveness of data-centric AI approaches to developing a prescription recognition system
AU - Kim, Jihyo
AU - Mun, Daejeong
AU - Hwang, Jaemoon
AU - Hwang, Sangheum
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.
PY - 2025
Y1 - 2025
N2 - Optical character recognition (OCR) has been in high demand across a wide range of fields and has been rapidly evolving since deep learning was introduced. The mainstream of OCR research focuses on model-centric approaches that improve performance by designing novel model architectures or learning algorithms. However, for industrial practitioners, such model-based approaches are not particularly useful for constructing application-specific OCR systems. In this study, we investigate the effectiveness of a data-centric approach to developing a Korean prescription recognition system. The proposed data-centric approach utilizes domain-specific synthetic data that reflect the visual properties and contextual priors of the target domain, allowing a model to learn domain-specific knowledge from training data. Specifically, the proposed data-centric approach constructs a domain-specific word dictionary for domain priors, and generates training synthetic images containing the visual properties of the target domain. For text recognition in prescription documents, where specialized knowledge is required, we demonstrate that the proposed data-centric approach is much more effective than model-centric approaches. Training using domain-specific synthetic data generated by the proposed data-centric approach facilitates precise predictions for texts requiring domain-specific knowledge.
AB - Optical character recognition (OCR) has been in high demand across a wide range of fields and has been rapidly evolving since deep learning was introduced. The mainstream of OCR research focuses on model-centric approaches that improve performance by designing novel model architectures or learning algorithms. However, for industrial practitioners, such model-based approaches are not particularly useful for constructing application-specific OCR systems. In this study, we investigate the effectiveness of a data-centric approach to developing a Korean prescription recognition system. The proposed data-centric approach utilizes domain-specific synthetic data that reflect the visual properties and contextual priors of the target domain, allowing a model to learn domain-specific knowledge from training data. Specifically, the proposed data-centric approach constructs a domain-specific word dictionary for domain priors, and generates training synthetic images containing the visual properties of the target domain. For text recognition in prescription documents, where specialized knowledge is required, we demonstrate that the proposed data-centric approach is much more effective than model-centric approaches. Training using domain-specific synthetic data generated by the proposed data-centric approach facilitates precise predictions for texts requiring domain-specific knowledge.
KW - Data-centric AI
KW - Optical character recognition
KW - Scene text recognition
KW - Synthetic data generation
UR - https://www.scopus.com/pages/publications/105005976765
U2 - 10.1007/s10032-025-00525-x
DO - 10.1007/s10032-025-00525-x
M3 - Article
AN - SCOPUS:105005976765
SN - 1433-2833
JO - International Journal on Document Analysis and Recognition
JF - International Journal on Document Analysis and Recognition
M1 - 103544
ER -