The 1st Mandarin Audio-Visual Speech Recognition Challenge (MAVSR)

ACM International Conference on Multimodal Interaction 2019

This challenge aims at exploring the complementarity between visual and acoustic information in real-world speech recognition systems, and will be held at ACM International Conference on Multimodal Interaction (ICMI) 2019. Audio-based speech recognition has made great progresses in the last decade, while still faces many challenges when coming to noisy conditions. With the rapid development of computer vision technologies, audio-visual speech recognition has become a hot topic. But it is still not so clear how much the visual speech can complement the acoustic speech.

In this challenge, we encourage not only the contributions achieving high recognition performance, but also the contributions bringing brave new ideas for the topic. Contributions which may not participate the challenge task but are relevant to the topic are also welcome.

Challenge Tracks

Challenge Tasks

Sub-challenge 1: Closed-set word-level speech recognition

The task here is to use the audio and the image data to recognize the word by classification. All the words in the test data have already appeared in the training set, but probably with different speaking conditions.

Sub-challenge 2: Open-set word-level speech recognition

The task here is to test whether the model has really learned the pronunciation rules. The task is to recognize the word, which may or may not appear in the training set. In the ideal case, the speech recognition model should be able to learn the true pronunciation rules from the audio and visual data. And so the test word should be spelled correctly no matter whether the word has appeared or not in the training set.

Sub-challenge 3: Visual keyword spotting

This task is to identify the existence of a keyword in a test video sequence. The keywords may not in the training set. Either recognition-based or recognition-free methods are welcome for this sub-challenge. This task is especially useful in practical systems, but always difficult to be very satisfying currently.

Registration

If you would like to participate the above tasks, please registrate here.

Data

For training and early evaluation, we recommend using the recently released large-scale datasets LRW-1000, which is released in the paper named “LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild”. This is a naturally-distributed large-scale benchmark for word-level lipreading in the wild, including 1000 classes with about 718,018 video samples from more than 2000 individual speakers. There are more than 1,000,000 Chinese character instances in total. Each class corresponds to the syllables of a Mandarin word which is composed of one or several Chinese characters. The dataset has shown a natural variability over different speech modes and imaging conditions, including the number of samples in each class, resolution of videos, lighting conditions, and speakers' attributes such as pose, age, gender, make-up and so on, as shown in Figure 1. It is supposed that recognition models should be able to learn the patterns of each word and the pronunciation rules from this data. Please note that each team could apply for this dataset to start the training process NOW!

For later validation and final test, we would release a new word-level data set, which has the same format as LRW-1000, but with more naturally noisy data. Therefore, only audio-based speech recognition would probably give poor performance on the noisy data and audio-visual based speech recognition is supposed to perform better than single modality based models.

For the open-set sub-challenge, we would give a potential word list which contains all the words, not limited to the test words, to ease the difficulty.

For all the data involved, we would provide the ground-truth labels in the format of word spellings which are composed by English alphabets, and so anyone knowing English words could participate in the challenge, no matter whether he or she does or does not know Mandarin.

Baseline Model

For task 1, we have released a baseline model in the github, which has achieved an accuracy of 34.76% in LRW-1000 as shown in this paper. Welcome each participant to use this model as the beginning model if you can’t find a proper one.

...

Figure 1. Samples in LRW-1000

Submission

Results Submission

Participants are free to perform one, two or all of the three sub-challenges, but are encouraged to contribute in more than one sub-challenge.

Ranking on Validation set


Team Affiliation Task Accuracy mAP
Video Audio Merged Group (VAMG) Northwestern Polytechnical University 1: Close-set word-level speech recognition (Audio-Visual) 80.49%
XiTianQuJing Australian National University 1: Close-set word-level speech recognition (Visual) 37.18%
Zhao * Zhejiang University of Technology 1: Close-set word-level speech recognition (Visual) 32.80%
Video Audio Merged Group (VAMG) Northwestern Polytechnical University 2: Open-set word-level speech recognition (Audio) 55.67%
XiTianQuJing Australian National University 3: Visual Keyword Spotting 17.1%
Video Audio Merged Group (VAMG) Northwestern Polytechnical University 3: Visual Keyword Spotting 4.5%

Ranking on Test set


Team Affiliation Task Accuracy mAP
Video Audio Merged Group (VAMG) Northwestern Polytechnical University 1: Close-set word-level speech recognition (Audio-Visual) 82.78%
Video Audio Merged Group (VAMG) Northwestern Polytechnical University 1: Close-set word-level speech recognition (Audio) 76.72%
XiTianQuJing Australian National University 1: Close-set word-level speech recognition (Visual) 37.51%
Video Audio Merged Group (VAMG) Northwestern Polytechnical University 1: Close-set word-level speech recognition (Visual) 37.05%
Zhao * Zhejiang University of Technology 1: Close-set word-level speech recognition (Visual) 34.59%
Video Audio Merged Group (VAMG) Northwestern Polytechnical University 2: Open-set word-level speech recognition (Audio) 55.28%
XiTianQuJing Australian National University 3: Visual Keyword Spotting 19.0%
Video Audio Merged Group (VAMG) Northwestern Polytechnical University 3: Visual Keyword Spotting 6.6%

Paper submission

Important Dates

All deadlines are 23:59:59 PST


Date      Description Note
April 11, 2019 Challenge website
May 19, 2019 Release of the validation data The validation data has been released! Please ensure that the registration is successful before your application.
May 19, 2019
– June 30, 2019
Results submission and ranking on the validation data Finished.
July 1, 2019 Release of the final test data Finished
July 7, 2019 Final results submission deadline Finished.
July 15, 2019 Paper submission deadline Finished.
July 30, 2019 Paper decision notification Finished.
August 10, 2019 Camera ready Finished.

News:

2019.5 The validation data has been released! Please note that any team who apply for the application data should submit the corresponding results of the related task in the end. For application, please provide the team name, the sub-challenge id, and the previous LRW-1000 agreement, then we will send you the corresponding download link.

2019.7 The test data has been released and sent it to related teams!

2019.7 [Call for Papers] The submission on the test data has been closed. And we sincerely invite each team to submit their corresponding papers to ICMI. Also, anyone who has not submitted the results are also welcome to share new ideas, new methods or related progresses in the related domains. All submissions will be rigorously assessed by a double-blind peer review process. Please click the following link to submit your paper: https://new.precisionconference.com/icmi19a. If you have any questions, please feel free to contact lipreading@vipl.ict.ac.cn.

AWARDS

Considering the large gap between the visual speech recognition and audio speech recognition, we set prize for both visual-only results and audio-visual combined results.
Details are shown as follows.


Sub-Challenge 1:

Close-set word-level speech recognition

Sub-Challenge 2:

Open-set word-level speech recognition

Sub-Challenge 3:

Visual keyword spotting
Visual | Audio-Visual Visual | Audio-Visual Visual
 The First Place  3000 | 3000 RMB 4000 | 4000 RMB 5000 RMB
The Runner-up 1000 | 1000 RMB 2000 | 2000 RMB 3000 RMB
 The Second Runner-up  - - 2000 RMB

The awards are sponsored by SeetaTech.

ORGANIZERS



FAQ

Q: How do we apply for the validation data?

A: You can apply by sending the team name and the ID of the task in which your team will participate to lipreading@vipl.ict.ac.cn.

Note that any team that applies for the data will be taken into the ranking list, which will be posted on the website. And so please send us the program, the model and the results on the validation set via email (to lipreading@vipl.ict.ac.cn) before the release of the test set where the program and model are needed to avoid some cheating cases in the results..

Q: How can we obtain the database release agreement file for LRW-1000?

A: You can get the agreement file here, and the agreement template is also available here. Read it carefully, and complete it appropriately. Note that the agreement should be signed by a full-time staff member (that is, students are not acceptable). Then, please scan the signed agreement and send it to dalu.feng@vipl.ict.ac.cn and cc to shuang.yang@ict.ac.cn. When we receive your reply, we will provide the download link to you.

Q: What are the queries, or keywords, defined in task 3?

A: The 1000 classes in the LRW-1000 dataset are all the potential keywords.

Q: Will the ranking list on validation-set be available on the website?

A: Yes. We will start releasing ranking on the validation data later.

Q: Is there any limit to the submission of results for the validation set?

A: Yes. Each team has up to 5 submission attempts per week on the validation set for each sub-challenge.

Q: Can we use the full LRW-1000 dataset for training (including the validation and test sets), or are we supposed to use only the training set of the LRW-1000 dataset?

A: Yes, the participants may use the validation and test sets of the LRW-1000 dataset. But please clarify this point when your team submits the results, and also clarify it in the corresponding paper submission.