IMPROVING THE PERFORMANCE OF KNOWLEDGE DISTILLATION IN NATURAL LANGUAGE PROCESSING USING A POOR TEACHER AND TRIPLET LOSS

Hyorim Park, Nam Wook Cho

Research output: Contribution to journalArticlepeer-review

Abstract

Knowledge distillation is transferring knowledge from a complex model to a simpler one to train a smaller and more efficient one. The performance of a knowledge distillation model, a small model that learns to imitate a larger model, depends on how well it can mimic the larger model. However, existing methods are based on a structure that uses only one teacher model for imitation during training. In the existing knowledge distillation for natural language processing, the student model is trained using the outputs of the Multi-Head Attention Layer and the Hidden State of the teacher model, where a single teacher is commonly used. This paper proposes a methodology for improving the performance of the natural language processing knowledge distillation classification model by using a Poor Teacher as an auxiliary teacher and by utilizing Triplet Loss. Triplet Loss is a loss function that trains positive examples to be close to the anchor and negative examples to be far away from the anchor. Unlike using the teacher model as a positive example, Poor Teacher, which reduces the layers of the imitation target teacher model to one, is designed to show relatively low performance as a negative example, containing inappropriate information for the student model to learn. During the training of the student model, we train the teacher model that needs to be learned through Triplet Loss to be closer while enforcing a greater separation from the Poor Teacher, resulting in performance improvements of benchmark tasks in the GLUE Dataset for NLP Task, compared to the baseline using the conventional method.

Original languageEnglish
Pages (from-to)279-284
Number of pages6
JournalICIC Express Letters, Part B: Applications
Volume16
Issue number3
DOIs
StatePublished - Mar 2025

Keywords

  • Knowledge distillation
  • Natural language processing
  • Student model
  • Teacher model
  • Triplet Loss

Fingerprint

Dive into the research topics of 'IMPROVING THE PERFORMANCE OF KNOWLEDGE DISTILLATION IN NATURAL LANGUAGE PROCESSING USING A POOR TEACHER AND TRIPLET LOSS'. Together they form a unique fingerprint.

Cite this