TY - JOUR
T1 - Self-Supervised Representation Learning for Basecalling Nanopore Sequencing Data
AU - Vintimilla, Carlos
AU - Hwang, Sangheum
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - Basecalling is a complex task that involves translating noisy raw electrical signals into their corresponding DNA sequences. Several deep learning architectures have been successful in improving basecalling accuracy, but all of them rely on a supervised training scheme and require large annotated datasets to achieve high accuracy. However, obtaining labeled data for some species can be extremely challenging, making it difficult to generate a large amount of ground truth labels for training basecalling models. Self-supervised representation learning (SSL) has been shown to alleviate the need for large annotated datasets and, in some cases, enhance model performance. In this work, we investigate the effectiveness of self-supervised representation learning frameworks on the basecalling task. We consider SSL basecallers based on two well-known SSL frameworks, SimCLR and wav2vec2.0, and show that the self-supervised trained basecaller outperforms its supervised counterparts in both low and high data regimes, showing up to a 3% increase in performance when trained on only 1% of the total labeled data. Our results suggest that learning strong representations from unlabeled data can improve basecalling accuracy compared to state-of-the-art models across different architectures. Furthermore, we provide insights into representation learning for the basecalling task and discuss the role of continuous representations during SSL pretraining. Our code is publicly available at https://github.com/carlosvint/SSLBasecalling.
AB - Basecalling is a complex task that involves translating noisy raw electrical signals into their corresponding DNA sequences. Several deep learning architectures have been successful in improving basecalling accuracy, but all of them rely on a supervised training scheme and require large annotated datasets to achieve high accuracy. However, obtaining labeled data for some species can be extremely challenging, making it difficult to generate a large amount of ground truth labels for training basecalling models. Self-supervised representation learning (SSL) has been shown to alleviate the need for large annotated datasets and, in some cases, enhance model performance. In this work, we investigate the effectiveness of self-supervised representation learning frameworks on the basecalling task. We consider SSL basecallers based on two well-known SSL frameworks, SimCLR and wav2vec2.0, and show that the self-supervised trained basecaller outperforms its supervised counterparts in both low and high data regimes, showing up to a 3% increase in performance when trained on only 1% of the total labeled data. Our results suggest that learning strong representations from unlabeled data can improve basecalling accuracy compared to state-of-the-art models across different architectures. Furthermore, we provide insights into representation learning for the basecalling task and discuss the role of continuous representations during SSL pretraining. Our code is publicly available at https://github.com/carlosvint/SSLBasecalling.
KW - Basecalling
KW - nanopore sequencing
KW - representation learning
KW - self-supervised learning
KW - wav2vec2.0
UR - http://www.scopus.com/inward/record.url?scp=85200820606&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3440882
DO - 10.1109/ACCESS.2024.3440882
M3 - Article
AN - SCOPUS:85200820606
SN - 2169-3536
VL - 12
SP - 109355
EP - 109366
JO - IEEE Access
JF - IEEE Access
ER -