Audio-Visual Speech Understanding Group----Visual Information Processing and Learning (VIPL)

Location：

Home > Research>Audio-Visual Speech Understanding Group

Audio-Visual Speech Understanding Group

Leader： Shuang Yang / Shiguang Shan (Professor)

Email： shuang.yang [at] ict dot ac dot cn; sgshan [at] ict dot ac dot cn

Introduction of research group

* The Audio-Visual Speech Understanding Group has been founded since 2017, which takes Lip Reading as the core task now, with other auxiliary tasks such as Speech Emotion Analysis, Visual Voice Activity Detection, Visual Key-Word Spotting. The related technologies can be used to assist speech recognition, to achieve a more intelligent and robust human-computer interaction, or can be used independently in intelligent teaching, security verification, public security and other fields.

For the related codes and paper of the team, please refer to : https://github.com/yshnny/Collections-of-The-Lip-Reading-Work-of-VIPL-LR

* News:

2020.3: Four papers are accepted by IEEE FG 2020. The performance on both LRW nad LRW-1000 has achieved the state of the art.

2019.8: Our Lip-reading system was awarded as the “Innovation Star of Artificial Intelligence” in the competition of multimedia information technology in Chinese Congress on Artificial Intelligence

2019.6: We won the 2nd place in the AVA Challenge: Task-2 (Active Speaker) of ActivityNet Challenge 2019, which is denoted as the “ImageNet” Competition in the domain of video based action recognition. The results were released in CVPR 2019.

2019.4: ACM ICMI 2019-MAVSR competition starts! The competition was jointly organized by researchers from the Institute of Computing Technology (Chinese academy of sciences), Imperial College London, the university of Oxford and Samsung American Research Institute. For more details about the competition, please refer to MAVSR2019!

2018.10: The LR Group has released the large-scale naturally distributed lip reading dataset LRW-1000. This dataset is not only the currently largest word-level lip reading dataset, but also the only one public Mandarin lip reading dataset. For more details, please refer to the data pape.

2018.4~2018.10: The LR Group has been invited by CCTV-1 to show the lip reading technology and system to the whole television audiences. For more details, please click here.

Research

* Research Topics：

Visual Speech Recognition (VSR) | Lip Reading (LR)

This topic mainly focuses on how to use and especially only use visual information to infer what the speaker is saying in the video (with or without sound). It can be used to help hearing-impaired people, and also play an important role for many audio-based speech recognition systems, especially in nosiy environment.

2. Talking face Generation

This task aims at making the given static face images “talk” given words, i.e. generating a video based only a clip of speech and the given face images of the target identity.

3. Visual Voice Activity Detection (VVAD)

This topic focuses on how to use visual information for speech activity detection, which is important for many practical speech recognition systems.

4. Multi-modal VSR/ KWS/ VVAD

* Related Applications:

※ Lip code/password, Liveness detection, Command statement recognition, and help adjust pronunciation in intelligent education systems, and so on.

Papers

Journal Papers

Conference Papers

Yuanhang Zhang, Shuang Yang, Shiguang Shan, Xilin Chen. ES³: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 27069-27079, Seattle WA, USA, Jun. 17-21, 2024.
Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen. Cooperative Dual Attention for Audio-Visual Speech Enhancement with Visual Cues. British Machine Vision Conference (BMVC), Aberdeen, UK, Nov. 20-24, 2023.
Bingquan Xia, Shuang Yang, Shiguang Shan, Xilin Chen. UniLip: Learning Visual-Textual Mapping with Uni-Modal Data for Lip Reading. British Machine Vision Conference, Aberdeen, UK, Nov. 20-24, 2023.
Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen. Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading. British Machine Vision Conference, Aberdeen, UK, Nov. 20-24, 2023.
Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen. Audio-Driven Deformation Flow for Effective Lip Reading. 26th International Conference on Pattern Recognition (ICPR), pp. 274-280, Aug. 21-25, 2022, Montréal Québec / Cyberspace.
Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan and Xilin Chen. UniCon: Unified Context Network for Robust Active Speaker Detection. ACM International Conference on Multimedia (ACM Multimedia), pp. 3964-3972, Chengdu, China, Oct. 20-24, 2021.
Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, Xilin Chen, \"LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild,\" 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2019), pp. 1-8, Lille, France, May 14-18, 2019. （Oral）
Mingshuang Luo, Shuang Yang, Xilin Chen, Zitao Liu, Shiguang Shan, "Synchronous Bidirectional Learning for Multilingual Lip Reading," British Machine Vision Conference (BMVC), 2020.
Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen, "Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 356-363, 2020.
Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, Xilin Chen, "Deformation Flow Based Two-Stream Network for Lip Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364-370, 2020.
Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 273-280, 2020.
Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen, "Mutual Information Maximization for Effective Lip Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) , pp. 420-427, 2020.
Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu and Shiguang Shan. ICTCAS-UCAS-TAL Submission to the AVA-ActiveSpeaker Task (The 1st Place). IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshop of the International Challenge on Activity Recognition (ActivityNet), 2021.
Dalu Feng, Shuang Yang and Shiguang Shan. An Efficient Software for Building Lip Reading Models Without Pains. IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1-2, Virtual Event, Jul. 5-9, 2021.