Abstract
This research enhances video understanding by leveraging Transformer-based models like BERT for feature representation in two tasks: video question answering and humor prediction. For video QA, using BERT to represent visual and subtitle semantics improved accuracy on TVQA and Pororo datasets. A comparative study of Transformers linked performance differences to their pre-training methods. For humor prediction, a novel multimodal method using pose, face, and subtitle features in a sliding window outperformed previous approaches on a new comedy dataset. The work highlights the importance of selecting optimal features and models for deeper video analysis.
Biography
Prof. YANG Zekun is an Assistant Professor at Tokyo University of Science. He graduated from Osaka University in 2021 and has worked at Nagoya University. His research areas are Machine Learning and Language Processing.