The results of the 2022 AVA Active Speaker task, part of the ActivityNet AVA-Kinetics & Active Speaker Spatio-temporal Action Localization Challenge at CVPR 2022, were released on June 19, 2022. Repeating last year’s success, VIPL again won 1st place in the Active Speaker task.
The ActivityNet Challenge, held annually in conjunction with the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) since 2016, is a major event in video understanding and action recognition. It hosts a diverse range of tasks, covering problems like action classification, temporal and spatio-temporal action localization, event understanding, etc. Among them, the Active Speaker detection task, first introduced by the AVA team at Google Research in 2019, aims to evaluate algorithms that can densely detect active speakers in a video sequence and report the timestamps of the speech segments along with the position of the associated speakers. The task is based on YouTube movie clips and is highly challenging due to variations in language, head pose, and video resolution. The team from the Audio-Visual Speech Understanding Group at VIPL (members: Yuanhang Zhang, Master’s student; Susan Liang, undergraduate intern; Dr. Shuang Yang, Associate Professor; Dr. Shiguang Shan, Professor) participated in this year’s task and proposed a novel approach that implicitly models relationships between potential active speakers in different scenes to boost detection performance for long videos. Eventually, the team achieved 94.47% mAP on the test set, becoming a repeat winner of the task.
Figure 1. The competition organizers announcing live that VIPL’s submission has won this year’s task
Figure 2. The top-3 teams in the task, as reported by the challenge organizers (VIPL has won 1st place)