Abstract
This study proposes a novel multimodal learning model for video highlight detection that uniquely integrates visual features from video frames with affective audio features, including emotion labels, arousal, and valence. To our knowledge, this is the first approach to leverage all three affective components in audio for highlight detection. The model employs a Long Short-Term Memory (LSTM) network to fuse features extracted from Vision Transformer (ViT) for video and Wav2Vec 2.0 for audio. For the evaluation, we constructed a combined dataset with YouTube clips and KBS public broadcast videos. Experimental results show that our model significantly outperforms audio emotion-based and video-only baselines, achieving F1 Score improvements of approximately 27.35% and 66.15%, respectively. An ablation study further validates the contribution of affective audio and visual fusion.
| Original language | English |
|---|---|
| Article number | 113600 |
| Journal | Applied Soft Computing |
| Volume | 183 |
| DOIs | |
| State | Published - Nov 2025 |
Keywords
- Audio affective features
- Feature extraction
- Multimodal learning
- Video highlight detection
Fingerprint
Dive into the research topics of 'Affective audio feature-based multimodal learning model for extracting video highlights'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver