TY - JOUR
T1 - IMPROVING THE PERFORMANCE OF KNOWLEDGE DISTILLATION IN NATURAL LANGUAGE PROCESSING USING A POOR TEACHER AND TRIPLET LOSS
AU - Park, Hyorim
AU - Cho, Nam Wook
N1 - Publisher Copyright:
© 2025, ICIC International. All rights reserved.
PY - 2025/3
Y1 - 2025/3
N2 - Knowledge distillation is transferring knowledge from a complex model to a simpler one to train a smaller and more efficient one. The performance of a knowledge distillation model, a small model that learns to imitate a larger model, depends on how well it can mimic the larger model. However, existing methods are based on a structure that uses only one teacher model for imitation during training. In the existing knowledge distillation for natural language processing, the student model is trained using the outputs of the Multi-Head Attention Layer and the Hidden State of the teacher model, where a single teacher is commonly used. This paper proposes a methodology for improving the performance of the natural language processing knowledge distillation classification model by using a Poor Teacher as an auxiliary teacher and by utilizing Triplet Loss. Triplet Loss is a loss function that trains positive examples to be close to the anchor and negative examples to be far away from the anchor. Unlike using the teacher model as a positive example, Poor Teacher, which reduces the layers of the imitation target teacher model to one, is designed to show relatively low performance as a negative example, containing inappropriate information for the student model to learn. During the training of the student model, we train the teacher model that needs to be learned through Triplet Loss to be closer while enforcing a greater separation from the Poor Teacher, resulting in performance improvements of benchmark tasks in the GLUE Dataset for NLP Task, compared to the baseline using the conventional method.
AB - Knowledge distillation is transferring knowledge from a complex model to a simpler one to train a smaller and more efficient one. The performance of a knowledge distillation model, a small model that learns to imitate a larger model, depends on how well it can mimic the larger model. However, existing methods are based on a structure that uses only one teacher model for imitation during training. In the existing knowledge distillation for natural language processing, the student model is trained using the outputs of the Multi-Head Attention Layer and the Hidden State of the teacher model, where a single teacher is commonly used. This paper proposes a methodology for improving the performance of the natural language processing knowledge distillation classification model by using a Poor Teacher as an auxiliary teacher and by utilizing Triplet Loss. Triplet Loss is a loss function that trains positive examples to be close to the anchor and negative examples to be far away from the anchor. Unlike using the teacher model as a positive example, Poor Teacher, which reduces the layers of the imitation target teacher model to one, is designed to show relatively low performance as a negative example, containing inappropriate information for the student model to learn. During the training of the student model, we train the teacher model that needs to be learned through Triplet Loss to be closer while enforcing a greater separation from the Poor Teacher, resulting in performance improvements of benchmark tasks in the GLUE Dataset for NLP Task, compared to the baseline using the conventional method.
KW - Knowledge distillation
KW - Natural language processing
KW - Student model
KW - Teacher model
KW - Triplet Loss
UR - http://www.scopus.com/inward/record.url?scp=85217202906&partnerID=8YFLogxK
U2 - 10.24507/icicelb.16.03.279
DO - 10.24507/icicelb.16.03.279
M3 - Article
AN - SCOPUS:85217202906
SN - 2185-2766
VL - 16
SP - 279
EP - 284
JO - ICIC Express Letters, Part B: Applications
JF - ICIC Express Letters, Part B: Applications
IS - 3
ER -