Visual Information Processing and Learning
Visual Information Processing and Learning

Audio-Visual Speech Understanding Group

Leader:Shuang Yang / Shiguang Shan (Professor)

Email:shuang.yang [at] ict dot ac dot cn; sgshan [at] ict dot ac dot cn

* The Audio-Visual Speech Understanding Group has been founded since 2017, which takes Lip Reading as the core task now, with other auxiliary tasks such as Speech Emotion Analysis, Visual Voice Activity Detection, Visual Key-Word Spotting. The related technologies can be used to assist speech recognition, to achieve a more intelligent and robust human-computer interaction, or can be used independently in intelligent teaching, security verification, military public security and other fields.

For the related codes and paper of the team, please refer to :

* News:

2020.3Four papers are accepted by IEEE FG 2020. The performance on both LRW nad LRW-1000 has achieved the state of the art.

2019.8Our Lip-reading system was awarded as the “Innovation Star of Artificial Intelligence” in the competition of multimedia information technology in Chinese Congress on Artificial Intelligence
2019.6We won the 2nd place in the AVA Challenge: Task-2 (Active Speaker) of ActivityNet Challenge 2019, which is denoted as the “ImageNet” Competition in the domain of video based action recognition. The results were released in CVPR 2019.
2019.4: ACM ICMI 2019-MAVSR competition starts! The competition was jointly organized by researchers from the Institute of Computing Technology (Chinese academy of sciences), Imperial College London, the university of Oxford and Samsung American Research Institute. For more details about the competition, please refer to MAVSR2019!

2018.10: The LR Group has released the large-scale naturally distributed lip reading dataset LRW-1000. This dataset is not only the currently largest word-level lip reading dataset, but also the only one public Mandarin lip reading dataset. For more details, please refer to the data pape.

2018.4~2018.10: The LR Group has been invited by CCTV-1 to show the lip reading technology and system to the whole television audiences. For more details, please click here.

* Research Topics:

1. Visual Speech Recognition (VSR) | Lip Reading (LR)

This topic mainly focuses on how to use and especially only use visual information to infer what the speaker is saying in the video (with or without sound). It can be used to help hearing-impaired people, and also play an important role for many audio-based speech recognition systems, especially in nosiy environment.

2. Talking face Generation

This task aims at making the given static face images “talk” given words, i.e. generating a video based only a clip of speech and the given face images of the target identity.

3. Visual Voice Activity Detection (VVAD)

This topic focuses on how to use visual information for speech activity detection, which is important for many practical speech recognition systems.

4. Multi-modal VSR/ KWS/ VVAD

* Related Applications:

※ Lip code/password, Liveness detection, Command statement recognition, and help adjust pronunciation in intelligent education systems, and so on.


Journal Papers

Conference Papers

1.    Mingshuang Luo, Shuang Yang, Xilin Chen, Zitao Liu, Shiguang Shan, "Synchronous Bidirectional Learning for Multilingual Lip Reading," British Machine Vision Conference (BMVC), 2020. 【pdf】

2.    Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, Xilin Chen, "Deformation Flow Based Two-Stream Network for Lip Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 836-842, 2020. 【pdf】

3.    Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 69-76, 2020. 【pdf】

4.    Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen, "Mutual Information Maximization for Effective Lip Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) , pp. 843-850, 2020. 【pdf】

5.    Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen, "Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 851-858, 2020. 【pdf】

6.    Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, Xilin Chen, "LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild," 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2019), pp. 1-8, Lille, France, May 14-18, 2019. (Oral) 【pdf】

Visual Information Processing and Learning
  • Address :No.6 Kexueyuan South Road
  • Zhongguancun,Haidian District
  • Beijing,China
  • Postcode :100190
  • Tel : (8610)62600514
  • Valse

  • Big Lecture of DL

Copyright @ Visual Information Processing and Learning 京ICP备05002829号 京公网安备1101080060