Background

In daily conversations, people acquire information through auditory cues, such as voice, and visual cues, like lip movements, to facilitate the comprehension of spoken content. The audio may be drowned in the noise encountered in poor acoustic scenarios, making the content difficult to acquire. Lipreading infers content through lip movements, lying at the intersection of computer vision and natural language processing. In public security, lipreading is crucial in detecting facial forgery and determining liveness. Moreover, advances in lip reading also drive the development of audio-visual speech recognition and enhancement, providing an effective way to input video as a complement in challenging acoustic environments.

Currently, lipreading tasks primarily concentrate on English, emphasizing the crucial need for increased attention in Chinese. The heightened complexity of Chinese lipreading stems from the extensive number of Chinese characters and their complex relationship with lip movements. Additionally, the lack of large-scale Chinese datasets constrains research efforts. The existing open-source Chinese lipreading datasets are mostly from controlled settings like laboratory recordings. The contents of these videos are carefully prepared, leading to a more presentational style rather than colloquial chatting. Figure 1 illustrates examples of current open-source lipreading datasets and ours (MISP).

Schematic Diagram
Fig.1. Examples of several Chinese lipreading datasets

Lipreading can be categorized into two subtypes: word-level and sentence-level. The research on word-level lipreading has been extensively conducted. Sentence-level Chinese lipreading is even more challenging. Due to the inherent ambiguity in lip shapes, different words may exhibit similar lip movements. Variations in articulation habits might lead to different lip movements for the same word. The accuracy of sentence-level recognition is too low for practical use. Therefore, we narrow the focus of general sentence-level lipreading system to speaker-specific taks.

Considering the above perspectives, we organize the ChatCLP challenge in real home chatting environments to address wake word and target speaker lipreading. We are dedicated to fulfilling the requirements of waking up smart home devices within household settings and utilizing video for speech recognition with these smart home devices.

Challenge Features

  • Real conversation: participants speaking in a relaxed and unscripted fashion.
  • Diverse speaking styles: 200+ native Chinese participants involved in.
  • Complex backgrounds: 20+ real room backgrounds with various lighting conditions.
  • Colloquial chat: Continuous informal sentences with interjections and emotions.

Tasks

We organize the ChatCLR competition to better reflect real-world scenarios and complexities, consisting of the following two tasks:

  • Task 1: Wake Word Lipreading
  • Task 2: Target Speaker Lipreading

The wake word lipreading task focuses on activating smart home devices during conversational interactions. Target speaker lipreading requires participants to fine-tune their networks to recognize the continuous and colloquial spoken conversation.