TY - JOUR
T1 - Affective audio feature-based multimodal learning model for extracting video highlights
AU - Hwang, Yujung
AU - Kim, Chanhyeok
AU - Park, Gangmin
AU - Kwon, Hyuk Yoon
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/11
Y1 - 2025/11
N2 - This study proposes a novel multimodal learning model for video highlight detection that uniquely integrates visual features from video frames with affective audio features, including emotion labels, arousal, and valence. To our knowledge, this is the first approach to leverage all three affective components in audio for highlight detection. The model employs a Long Short-Term Memory (LSTM) network to fuse features extracted from Vision Transformer (ViT) for video and Wav2Vec 2.0 for audio. For the evaluation, we constructed a combined dataset with YouTube clips and KBS public broadcast videos. Experimental results show that our model significantly outperforms audio emotion-based and video-only baselines, achieving F1 Score improvements of approximately 27.35% and 66.15%, respectively. An ablation study further validates the contribution of affective audio and visual fusion.
AB - This study proposes a novel multimodal learning model for video highlight detection that uniquely integrates visual features from video frames with affective audio features, including emotion labels, arousal, and valence. To our knowledge, this is the first approach to leverage all three affective components in audio for highlight detection. The model employs a Long Short-Term Memory (LSTM) network to fuse features extracted from Vision Transformer (ViT) for video and Wav2Vec 2.0 for audio. For the evaluation, we constructed a combined dataset with YouTube clips and KBS public broadcast videos. Experimental results show that our model significantly outperforms audio emotion-based and video-only baselines, achieving F1 Score improvements of approximately 27.35% and 66.15%, respectively. An ablation study further validates the contribution of affective audio and visual fusion.
KW - Audio affective features
KW - Feature extraction
KW - Multimodal learning
KW - Video highlight detection
UR - https://www.scopus.com/pages/publications/105012376870
U2 - 10.1016/j.asoc.2025.113600
DO - 10.1016/j.asoc.2025.113600
M3 - Article
AN - SCOPUS:105012376870
SN - 1568-4946
VL - 183
JO - Applied Soft Computing
JF - Applied Soft Computing
M1 - 113600
ER -