Affective audio feature-based multimodal learning model for extracting video highlights

Yujung Hwang, Chanhyeok Kim, Gangmin Park, Hyuk Yoon Kwon

Research output: Contribution to journalArticlepeer-review

Abstract

This study proposes a novel multimodal learning model for video highlight detection that uniquely integrates visual features from video frames with affective audio features, including emotion labels, arousal, and valence. To our knowledge, this is the first approach to leverage all three affective components in audio for highlight detection. The model employs a Long Short-Term Memory (LSTM) network to fuse features extracted from Vision Transformer (ViT) for video and Wav2Vec 2.0 for audio. For the evaluation, we constructed a combined dataset with YouTube clips and KBS public broadcast videos. Experimental results show that our model significantly outperforms audio emotion-based and video-only baselines, achieving F1 Score improvements of approximately 27.35% and 66.15%, respectively. An ablation study further validates the contribution of affective audio and visual fusion.

Original languageEnglish
Article number113600
JournalApplied Soft Computing
Volume183
DOIs
StatePublished - Nov 2025

Keywords

  • Audio affective features
  • Feature extraction
  • Multimodal learning
  • Video highlight detection

Fingerprint

Dive into the research topics of 'Affective audio feature-based multimodal learning model for extracting video highlights'. Together they form a unique fingerprint.

Cite this