视听语言感知与理解组
组 长: 杨双 副研究员;山世光 研究员
Email: shuang dot yang [at] ict dot ac dot cn; sgshan [at] ict dot ac dot cn
课题组简介

视听语言感知与理解研究组成立于2017年,目前主要以唇语识别为核心任务,以相关的语音情感分析、视觉语音活性检测、关键说话内容检索等问题为辅助任务展开研究。相关技术可用于辅助语音识别,实现更加智能、鲁棒的人机交互,也可独立应用于辅助教学、安全验证、公共安全等领域。

小组相关论文与代码等请参见:https://github.com/yshnny/Collections-of-The-Lip-Reading-Work-of-VIPL-LR

* News:

2021.7研究组1篇论文被顶级国际会议ACM MM 2021接收,并被录为Oral

2021.6研究组与好未来联合团队获CVPR 2021 ActivityNet国际挑战赛-说话人检测任务冠军,详情请点击链接

2021.4研究组的唇语识别成果被引入华为智慧座舱系统,并在2021上海国际车展亮相。

2020.11研究组基于唇语识别的研究成果与其它团队联合参赛,获省部级创新大赛一等奖

2020.9研究组1篇唇语识别论文被国际会议BMVC 2020接收。

2020.3:研究组4篇论文被IEEE FG 2020接收,其中一篇录为Oral。同时目前已刷新同等条件下LRW与LRW-1000上的最优性能。相关论文链接请参阅本页面下方。

2019.8:研究组研发的唇语识别系统,获评中国人工智能 · 多媒体信息识别技术竞赛人工智能创新之星

2019.6:ActivityNet 2019挑战赛中AVA Challenge的Active Speaker任务,获第二名!ActivityNet Challenge被称为视频行为理解领域的ImageNet竞赛,在CVPR2019上公布了结果。

2019.4ACM ICMI 2019 - MAVSR竞赛启动!该竞赛由来自中国科学院计算技术研究所、英国帝国理工大学、英国牛津大学、三星美国研究院的研究人员联合申办,详情请点击 竞赛主页

2019.4:论文《LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild》被IEEE FG 2019接收,并被录为Oral!

2018.10:研究组发布唇语识别数据集LRW-1000。该数据库既是唇读领域内公开的最大规模的词级唇语识别数据集,也同时是目前唯一公开的大规模词级中文唇语识别数据集。详情请点击数据主页(内含论文和代码)

2018.4~2018.10:研究组受邀参加CCTV-1《机智过人》第二季节目录制,向全国观众展示唇语识别技术。详情请点击 这里


欢迎有计算机视觉、深度学习基础的同学前来客座实习!简历请发送至 lipreading@vipl.ict.ac.cn

研究内容

1.   唇语识别| Visual Speech Recognition (VSR) | Lip Reading (LR)

该项主题着重研究如何利用或只利用视觉信息进行说话内容的识别。可用于辅助语音识别的相关应用。

DEMO片段演示


2.   说话人脸转换或生成| Talking face Generation

该项主题着重研究在给定语音和目标人脸的条件下,生成目标人物说给定语音的视频。



3.   视觉语音活性检测| Visual Voice Activity Detection (VVAD)

该项主题着重研究利用视觉信息进行语音活性检测,该项技术可在噪音或无声环境下,辅助判断说话人的位置及说话起止时间,并可进一步用于说话人识别等相关场景。



4.音视觉结合的说话内容识别/检索、说话状态检测| Multi-modal VSR/ KWS/ VVAD


* 相关应用:

※ 唇语密码、活体检测、指令语句识别、发音口型打分
部分论文

刊物论文

  • Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen. Audio-guided self-supervised learning for disentangled visual speech representations. Frontiers of Computer Science (FCS), 18: 186353, 2024.

会议论文

  • Yuanhang Zhang, Shuang Yang, Shiguang Shan, Xilin Chen. ES³: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27059-27069, Seattle WA, USA, Jun. 17-21, 2024.
  • Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen. Cooperative Dual Attention for Audio-Visual Speech Enhancement with Visual Cues. British Machine Vision Conference (BMVC), Aberdeen, UK, Nov. 20-24, 2023.
  • Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen. Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading. British Machine Vision Conference, Aberdeen, UK, Nov. 20-24, 2023.
  • Bingquan Xia, Shuang Yang, Shiguang Shan, Xilin Chen. UniLip: Learning Visual-Textual Mapping with Uni-Modal Data for Lip Reading. British Machine Vision Conference, Aberdeen, UK, Nov. 20-24, 2023.
  • Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen. Audio-Driven Deformation Flow for Effective Lip Reading. 26th International Conference on Pattern Recognition (ICPR), pp. 274-280, Aug. 21-25, 2022, Montréal Québec / Cyberspace.
  • Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu and Shiguang Shan. ICTCAS-UCAS-TAL Submission to the AVA-ActiveSpeaker Task (The 1st Place). IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshop of the International Challenge on Activity Recognition (ActivityNet), 2021.
  • Dalu Feng, Shuang Yang and Shiguang Shan. An Efficient Software for Building Lip Reading Models Without Pains. IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1-2, Virtual Event, Jul. 5-9, 2021.
  • Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan and Xilin Chen. UniCon: Unified Context Network for Robust Active Speaker Detection. ACM International Conference on Multimedia (ACM Multimedia), pp. 3964-3972, Chengdu, China, Oct. 20-24, 2021.
  • Mingshuang Luo, Shuang Yang, Xilin Chen, Zitao Liu, Shiguang Shan, "Synchronous Bidirectional Learning for Multilingual Lip Reading," British Machine Vision Conference (BMVC), 2020.
  • Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen, "Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 356-363, 2020.
  • Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, Xilin Chen, "Deformation Flow Based Two-Stream Network for Lip Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364-370, 2020.
  • Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen, "Mutual Information Maximization for Effective Lip Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) , pp. 420-427, 2020.
  • Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading," IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 273-280, 2020.
  • Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, Xilin Chen, "LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild," 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2019), pp. 1-8, Lille, France, May 14-18, 2019. (Oral)