TY - JOUR
T1 - Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information
AU - Min, Seonwoo
AU - Park, Seunghyun
AU - Kim, Siwon
AU - Choi, Hyun Soo
AU - Lee, Byunghan
AU - Yoon, Sungroh
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2021
Y1 - 2021
N2 - Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks. Most pre-training methods solely rely on language modeling and often exhibit limited performance. In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same-family prediction. PLUS can be used to pre-train various model architectures. In this work, we use PLUS to pre-train a bidirectional recurrent neural network and refer to the resulting model as PLUS-RNN. Our experiment results demonstrate that PLUS-RNN outperforms other models of similar size solely pre-trained with the language modeling in six out of seven widely used protein biology tasks. Furthermore, we present the results from our qualitative interpretation analyses to illustrate the strengths of PLUS-RNN. PLUS provides a novel way to exploit evolutionary relationships among unlabeled proteins and is broadly applicable across a variety of protein biology tasks. We expect that the gap between the numbers of unlabeled and labeled proteins will continue to grow exponentially, and the proposed pre-training method will play a larger role. All the data and codes used in this study are available at https://github.com/mswzeus/PLUS.
AB - Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks. Most pre-training methods solely rely on language modeling and often exhibit limited performance. In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same-family prediction. PLUS can be used to pre-train various model architectures. In this work, we use PLUS to pre-train a bidirectional recurrent neural network and refer to the resulting model as PLUS-RNN. Our experiment results demonstrate that PLUS-RNN outperforms other models of similar size solely pre-trained with the language modeling in six out of seven widely used protein biology tasks. Furthermore, we present the results from our qualitative interpretation analyses to illustrate the strengths of PLUS-RNN. PLUS provides a novel way to exploit evolutionary relationships among unlabeled proteins and is broadly applicable across a variety of protein biology tasks. We expect that the gap between the numbers of unlabeled and labeled proteins will continue to grow exponentially, and the proposed pre-training method will play a larger role. All the data and codes used in this study are available at https://github.com/mswzeus/PLUS.
KW - Protein sequence
KW - protein structure
KW - representation learning
KW - semi-supervised learning
UR - https://www.scopus.com/pages/publications/85114716058
U2 - 10.1109/ACCESS.2021.3110269
DO - 10.1109/ACCESS.2021.3110269
M3 - Article
AN - SCOPUS:85114716058
SN - 2169-3536
VL - 9
SP - 123912
EP - 123926
JO - IEEE Access
JF - IEEE Access
ER -