This challenge aims at exploring the complementarity between visual and acoustic information in real-world speech recognition systems, and will be held at ACM International Conference on Multimodal Interaction (ICMI) 2019. Audio-based speech recognition has made great progresses in the last decade, while still faces many challenges when coming to noisy conditions. With the rapid development of computer vision technologies, audio-visual speech recognition has become a hot topic. But it is still not so clear how much the visual speech can complement the acoustic speech.
In this challenge, we encourage not only the contributions achieving high recognition performance, but also the contributions bringing brave new ideas for the topic. Contributions which may not participate the challenge task but are relevant to the topic are also welcome.
The task here is to use the audio and the image data to recognize the word by classification. All the words in the test data have already appeared in the training set, but probably with different speaking conditions.
The task here is to test whether the model has really learned the pronunciation rules. The task is to recognize the word, which may or may not appear in the training set. In the ideal case, the speech recognition model should be able to learn the true pronunciation rules from the audio and visual data. And so the test word should be spelled correctly no matter whether the word has appeared or not in the training set.
This task is to identify the existence of a keyword in a test video sequence. The keywords may not in the training set. Either recognition-based or recognition-free methods are welcome for this sub-challenge. This task is especially useful in practical systems, but always difficult to be very satisfying currently.
If you would like to participate the above tasks, please registrate here.
For training and early evaluation, we recommend using the recently released large-scale datasets LRW-1000, which is released in the paper named “LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild”. This is a naturally-distributed large-scale benchmark for word-level lipreading in the wild, including 1000 classes with about 718,018 video samples from more than 2000 individual speakers. There are more than 1,000,000 Chinese character instances in total. Each class corresponds to the syllables of a Mandarin word which is composed of one or several Chinese characters. The dataset has shown a natural variability over different speech modes and imaging conditions, including the number of samples in each class, resolution of videos, lighting conditions, and speakers' attributes such as pose, age, gender, make-up and so on, as shown in Figure 1. It is supposed that recognition models should be able to learn the patterns of each word and the pronunciation rules from this data. Please note that each team could apply for this dataset to start the training process NOW!
For later validation and final test, we would release a new word-level data set, which has the same format as LRW-1000, but with more naturally noisy data. Therefore, only audio-based speech recognition would probably give poor performance on the noisy data and audio-visual based speech recognition is supposed to perform better than single modality based models.
For the open-set sub-challenge, we would give a potential word list which contains all the words, not limited to the test words, to ease the difficulty.
For all the data involved, we would provide the ground-truth labels in the format of word spellings which are composed by English alphabets, and so anyone knowing English words could participate in the challenge, no matter whether he or she does or does not know Mandarin.
For task 1, we have released a baseline model in the github, which has achieved an accuracy of 34.76% in LRW-1000 as shown in this paper. Welcome each participant to use this model as the beginning model if you can’t find a proper one.
Figure 1. Samples in LRW-1000
Participants are free to perform one, two or all of the three sub-challenges, but are encouraged to contribute in more than one sub-challenge.
Ranking on validation-set: Each team could submit their results once the validation set has been released. This ranking list will be updated before the release of the final test data.
Ranking on test-set: Each team has up to 5 submission attempts on the final test set for each sub-challenge. The results on the final test-set would be used to decide the final prize.
The task-1 and task-2 will be ranked by the accuracy measure, and the task-3 will be ranked by the mAP measure.
A participating team should submit their working programs together with the corresponding test results in a zip file by sending an email to email@example.com. The name of the zip file containing the results should be the team-name together with the task name, such as “win_lr@task1” where ”win_lr” is the team name and “task1” is the submitted task. The test results should be submitted in the following format:
For sub-challenge 1: The results should contain three types of information as follows: sample index@predicted_word@predicted_label_index. The sample index could be omitted if the results are in the same order as the proposed validataion annotation (on the validation data) or the order of the test data (on the test data). The predicted label index should keep in the same order as the words in the provided vocabulary. For example, “11_target_90” means this is for the 11th test sample and the predicted word is “target” which is the 90th word in the provided vocabulary (start from 0).
For sub-challenge 2: We would give a potential word-list which includes all, but not limited to, the test or train words. Therefore, the participants should submit the results in the same format as described above in the sub-challenge 1, where the provided potential word-list could be seen as a vocabulary;
For sub-challenge 3: Only binary results are enough to indicate whether the keyword is in the video. The proposed results are required to include the information of the indicator of query sample and the target sample. For example, "11_8_17_31_33_57" means that the 11th query (keyword) is existed in the 8th , 17th , 31st, 33rd, 57th sample in the validation and test set.
Please make sure your system can be executed in linux and produce the corresponding results as the submitted results in the email. The input and output of the system are fixed to be the dataset path and the predicted label and words. An extra readme file should be provided to clarify the environmental dependencies, the deep learning framework, the involved libraries, and so on, to make the program executable.
|Video Audio Merged Group (VAMG)||Northwestern Polytechnical University||1: Close-set word-level speech recognition (Audio-Visual)||80.49%10>||–|
|XiTianQuJing||Australian National University||1: Close-set word-level speech recognition (Visual)||37.18%10>||–|
|Zhao *||Zhejiang University of Technology||1: Close-set word-level speech recognition (Visual)||32.80%10>||–|
|Video Audio Merged Group (VAMG)||Northwestern Polytechnical University||2: Open-set word-level speech recognition (Audio)||55.67%10>||–|
|XiTianQuJing||Australian National University||3: Visual Keyword Spotting||–10>||17.1%|
|Video Audio Merged Group (VAMG)||Northwestern Polytechnical University||3: Visual Keyword Spotting||–10>||4.5%|
|Video Audio Merged Group (VAMG)||Northwestern Polytechnical University||1: Close-set word-level speech recognition (Audio-Visual)||82.78%10>||–|
|Video Audio Merged Group (VAMG)||Northwestern Polytechnical University||1: Close-set word-level speech recognition (Audio)||76.72%10>||–|
|XiTianQuJing||Australian National University||1: Close-set word-level speech recognition (Visual)||37.51%10>||–|
|Video Audio Merged Group (VAMG)||Northwestern Polytechnical University||1: Close-set word-level speech recognition (Visual)||37.05%10>||–|
|Zhao *||Zhejiang University of Technology||1: Close-set word-level speech recognition (Visual)||34.59%10>||–|
|Video Audio Merged Group (VAMG)||Northwestern Polytechnical University||2: Open-set word-level speech recognition (Audio)||55.28%10>||–|
|XiTianQuJing||Australian National University||3: Visual Keyword Spotting||–10>||19.0%|
|Video Audio Merged Group (VAMG)||Northwestern Polytechnical University||3: Visual Keyword Spotting||–10>||6.6%|
A paper submission and at least one upload on the test set are mandatory for the participation in the challenge. However, paper contributions within the scope are also welcome if the authors have not participated in the challenge tasks.
Call For Papers Now! The topics include, but are not limited to: lip reading, audio-visual speech processing, visual or audio-visual keyword spotting, talking face generation, and so on. The paper should be submitted in PDF and should be no more than 4 pages in the two column ACM conference format (excluding references). All submissions will be rigorously assessed by a double-blind peer review process. Accepted papers will be included in the proceedings of ICMI. Please click the following link to submit your paper: https://new.precisionconference.com/icmi19a.
|April 11, 2019||Challenge website||–|
|May 19, 2019||Release of the validation data||The validation data has been released! Please ensure that the registration is successful before your application.|
|May 19, 2019
– June 30, 2019
|Results submission and ranking on the validation data||Finished.|
|July 1, 2019||Release of the final test data||Finished|
|July 7, 2019||Final results submission deadline||Finished.|
|July 15, 2019||Paper submission deadline||Finished.|
|July 30, 2019||Paper decision notification||Finished.|
|August 10, 2019||Camera ready||Finished.|
2019.5 The validation data has been released! Please note that any team who apply for the application data should submit the corresponding results of the related task in the end. For application, please provide the team name, the sub-challenge id, and the previous LRW-1000 agreement, then we will send you the corresponding download link.
2019.7 The test data has been released and sent it to related teams!
2019.7 [Call for Papers] The submission on the test data has been closed. And we sincerely invite each team to submit their corresponding papers to ICMI. Also, anyone who has not submitted the results are also welcome to share new ideas, new methods or related progresses in the related domains. All submissions will be rigorously assessed by a double-blind peer review process. Please click the following link to submit your paper: https://new.precisionconference.com/icmi19a. If you have any questions, please feel free to contact firstname.lastname@example.org.
Considering the large gap between the visual speech recognition and audio speech recognition, we set prize for both visual-only results and audio-visual combined results.
Details are shown as follows.
Sub-Challenge 1:Close-set word-level speech recognition
Sub-Challenge 2:Open-set word-level speech recognition
Sub-Challenge 3:Visual keyword spotting
|Visual | Audio-Visual||Visual | Audio-Visual||Visual|
|The First Place||3000 | 3000 RMB||4000 | 4000 RMB||5000 RMB|
|The Runner-up||1000 | 1000 RMB||2000 | 2000 RMB||3000 RMB|
|The Second Runner-up||-||-||2000 RMB|
The awards are sponsored by SeetaTech.
Q: How do we apply for the validation data?
A: You can apply by sending the team name and the ID of the task in which your team will participate to email@example.com.
Note that any team that applies for the data will be taken into the ranking list, which will be posted on the website. And so please send us the program, the model and the results on the validation set via email (to firstname.lastname@example.org) before the release of the test set where the program and model are needed to avoid some cheating cases in the results..
Q: How can we obtain the database release agreement file for LRW-1000?
A: You can get the agreement file here, and the agreement template is also available here. Read it carefully, and complete it appropriately. Note that the agreement should be signed by a full-time staff member (that is, students are not acceptable). Then, please scan the signed agreement and send it to email@example.com and cc to firstname.lastname@example.org. When we receive your reply, we will provide the download link to you.
Q: What are the queries, or keywords, defined in task 3?
A: The 1000 classes in the LRW-1000 dataset are all the potential keywords.
Q: Will the ranking list on validation-set be available on the website?
A: Yes. We will start releasing ranking on the validation data later.
Q: Is there any limit to the submission of results for the validation set?
A: Yes. Each team has up to 5 submission attempts per week on the validation set for each sub-challenge.
Q: Can we use the full LRW-1000 dataset for training (including the validation and test sets), or are we supposed to use only the training set of the LRW-1000 dataset?
A: Yes, the participants may use the validation and test sets of the LRW-1000 dataset. But please clarify this point when your team submits the results, and also clarify it in the corresponding paper submission.