Abstract
We propose a novel approach for sentence boundary detection in text datasets in which boundaries are not evident (e.g., sentence fragments). Although detecting sentence boundaries without punctuation marks has rarely been explored in written text, current real-world textual data suffer from widespread lack of proper start/stop signaling. Herein, we annotate a dataset with linguistic information, such as parts of speech and named entity labels, to boost the sentence boundary detection task. Via experiments, we obtained F1 scores up to 98.07% using the proposed multitask neural model, including a score of 89.41% for sentences completely lacking punctuation marks. We also present an ablation study and provide a detailed analysis to demonstrate the effectiveness of the proposed multitask learning method.
| Original language | English |
|---|---|
| Pages (from-to) | 150-170 |
| Number of pages | 21 |
| Journal | Natural Language Engineering |
| Volume | 30 |
| Issue number | 1 |
| DOIs | |
| State | Published - 6 Jan 2024 |
Keywords
- Corpus creation
- French
- Multitask learning
- Sentence boundary detection
Fingerprint
Dive into the research topics of 'Real-world sentence boundary detection using multitask learning: A case study on French'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver