TY - GEN
T1 - Describing Environmental Information in Videos Using Machine Learning
AU - Jeong, Yoon Jin
AU - Htun, Soe Sandi
AU - Han, Ji Hyeong
N1 - Publisher Copyright:
© 2021 ICROS.
PY - 2021
Y1 - 2021
N2 - The previous researches of video captioning task have focused on human actions or objects in videos, however, environmental information such as place, time, weather among others is also important information to understand videos. Therefore, in this paper, we create a new dataset which adds environmental information labels to MSVD dataset and train the machine learning model to analyze environmental information from videos. We apply R(2+1)D which is a 3D CNN model to extract video features and S2VT which is a RNN model to encode the video features and to decode the environmental information. The reason why we define the problem as a sequence to sequence problem, not multilabel classification, is that the input is a video, which is a sequence of frames, and the output is also related with each other. For example, if the place label is outside, then next label would be weather. We analyze the experimental results based on BLEU, METEOR, ROUGE-L, and CIDEr and it shows the competitive results compared to the state-of-the-art video captioning model.
AB - The previous researches of video captioning task have focused on human actions or objects in videos, however, environmental information such as place, time, weather among others is also important information to understand videos. Therefore, in this paper, we create a new dataset which adds environmental information labels to MSVD dataset and train the machine learning model to analyze environmental information from videos. We apply R(2+1)D which is a 3D CNN model to extract video features and S2VT which is a RNN model to encode the video features and to decode the environmental information. The reason why we define the problem as a sequence to sequence problem, not multilabel classification, is that the input is a video, which is a sequence of frames, and the output is also related with each other. For example, if the place label is outside, then next label would be weather. We analyze the experimental results based on BLEU, METEOR, ROUGE-L, and CIDEr and it shows the competitive results compared to the state-of-the-art video captioning model.
KW - 3D CNN
KW - Machine Vision
KW - RNN
KW - Video Captioning
KW - Visual Recognition
UR - https://www.scopus.com/pages/publications/85124188939
U2 - 10.23919/ICCAS52745.2021.9649840
DO - 10.23919/ICCAS52745.2021.9649840
M3 - Conference contribution
AN - SCOPUS:85124188939
T3 - International Conference on Control, Automation and Systems
SP - 2247
EP - 2249
BT - 2021 21st International Conference on Control, Automation and Systems, ICCAS 2021
PB - IEEE Computer Society
T2 - 21st International Conference on Control, Automation and Systems, ICCAS 2021
Y2 - 12 October 2021 through 15 October 2021
ER -