TY - JOUR
T1 - Real-world sentence boundary detection using multitask learning
T2 - A case study on French
AU - Lim, Kyungtae T.
AU - Park, Jungyeul
N1 - Publisher Copyright:
© The Author(s), 2022. Published by Cambridge University Press.
PY - 2024/1/6
Y1 - 2024/1/6
N2 - We propose a novel approach for sentence boundary detection in text datasets in which boundaries are not evident (e.g., sentence fragments). Although detecting sentence boundaries without punctuation marks has rarely been explored in written text, current real-world textual data suffer from widespread lack of proper start/stop signaling. Herein, we annotate a dataset with linguistic information, such as parts of speech and named entity labels, to boost the sentence boundary detection task. Via experiments, we obtained F1 scores up to 98.07% using the proposed multitask neural model, including a score of 89.41% for sentences completely lacking punctuation marks. We also present an ablation study and provide a detailed analysis to demonstrate the effectiveness of the proposed multitask learning method.
AB - We propose a novel approach for sentence boundary detection in text datasets in which boundaries are not evident (e.g., sentence fragments). Although detecting sentence boundaries without punctuation marks has rarely been explored in written text, current real-world textual data suffer from widespread lack of proper start/stop signaling. Herein, we annotate a dataset with linguistic information, such as parts of speech and named entity labels, to boost the sentence boundary detection task. Via experiments, we obtained F1 scores up to 98.07% using the proposed multitask neural model, including a score of 89.41% for sentences completely lacking punctuation marks. We also present an ablation study and provide a detailed analysis to demonstrate the effectiveness of the proposed multitask learning method.
KW - Corpus creation
KW - French
KW - Multitask learning
KW - Sentence boundary detection
UR - https://www.scopus.com/pages/publications/85128531911
U2 - 10.1017/S1351324922000134
DO - 10.1017/S1351324922000134
M3 - Article
AN - SCOPUS:85128531911
SN - 1351-3249
VL - 30
SP - 150
EP - 170
JO - Natural Language Engineering
JF - Natural Language Engineering
IS - 1
ER -