Task 1: Wake Word Lipreading

We use the far-field videos released in the MISP 2021 Task 1, collected in home TV scenarios. The wake-up word is “Xiao T Xiao T”. There are 338 speakers. The dataset’s accent is Mandarin; all data were collected in 33 real rooms. A sample will be taken as a positive sample if the wake-up word is included, otherwise, it will be regarded as a negative sample. For each sample, at most one wake-up word is included. The statistics of the dataset are shown in Table 1.

Table 1: Task 1 Dataset
Dataset	Train		Dev		Eval		Total
Dataset	Positive	Negative	Positive	Negative	Positive	Negative	Positive	Negative
Duration (h)	1.83	35.96	0.16	0.76	0.56	2.34	2.55	39.06
Session	88		8		27		148
Room	25		5		8		38
Participant	253		28		47		328
Male	79		9		27		115
Female	174		19		20		213

On one side, our wake word spotting dataset encompasses over 30 rooms and 300 speakers, showcasing diverse rooms and different speakers, thus enhancing the complexity and diversity of our dataset. Conversely, this dataset includes words that share similar lip shapes with the wake-up words, amplifying the difficulty.

Task 2: Target Speaker Lipreading

We utilize the far-field videos from the training and development sets of MISP2021 AVSR dataset to construct the training set for Task 2. The development and evaluation sets contain 6 males and 6 females, whose videos are also included in the training set. Each speaker possesses approximately 30 minutes of data. Two-thirds of each person's data make up the development set, while the remaining data make up the evaluation set. The statistics of the dataset are shown in Table 2.

Table 2: Task 2 Dataset
Dataset	Train	Dev	Eval	Total
Duration (h)	110.95	4.50	2.41	117.86
Session	339	12	6	357
Participant	229	12	12	224
Male	90	6	6	91
Female	139	6	6	133

To cover the real scene comprehensively and evenly, we designed the following recording configuration by controlling variables as in Table 3. ‘Time’ refers to the recording time, the value is day or night. ‘Content’ refers to the speaking content. We also recorded some data only containing wake-up/similar words to support the audio-visual wake-word spotting task. ‘Light’ refers to turning on/off the light. ‘TV’ refers to turning on/off the TV. ‘Group’ refers to how many groups of participants are in a conversation.

Configuration
Config ID	Time	Content	Light	TV	Group
01	Day	Talk freely	off	on	1
02			on	on	1
03			off	off	2
04			on	off	2
05			off	on	2
06			on	on	2
07			on	off	1
08			off	off	1
09	Night		on	on	1
10			on	off	2
11			on	on	2
12			on	off	1

By observing the real conversations that were taking place in the real living room, we found that the participants would be divided into several groups to discuss different topics. Compared with all participants discussing the same topic, grouping would result in higher overlap ratios. We found that the average speech overlap ratios of Group = 1 and Group = 2 are 10% ~ 20% and 50% ~ 70%, respectively. The number of groups greater than 3 is very rare when the number of participants is no more than 6.

Scenario and Recording Setup

Figure 2 is a schematic diagram, showing the recording scene with six participants. According to the distance between the device and the speaker, multiple recording devices were divided into 2 categories:

Far devices: A wide-angle camera (1080p, 25 fps, 2pi/3), which is placed 3-5m away from the speaker. All participants appear in the camera, which brings speakers position information while reducing the resolution of the lip region of interest (ROI);
Middle devices: $n$ high-definition cameras (720p, 25fps, pi/2), placed 0.8-1.5m away from the speaker, where n is the number of participants within this conversation. There is only the corresponding speaker in each camera, and the lip ROI is recorded clearly;

Various devices have resulted in inconsistent clocks. We address that from two aspects: synchronization devices, and manual post-processing.

Synchronization devices: The clocks of near high-fidelity microphones while Vicando software, running on the industrial PC (MIC-770), is used to synchronize the clocks of all cameras.

Schematic Diagram — Fig.2. The schematic diagram of record scene

In the far field, a wide-angle camera will capture all speakers simultaneously, mirroring the most common scenario in real-life situations. All participants freely select topics and engage in unrestrained discourse in real home settings, resembling the most prevalent multi-speaker conversational scenarios in daily life. There still are some variables in the conversation that are taking place in the real living room, for example, the conversation is happening during the day or night. Specifically, by observing the real conversations in the real living room, we found that participants would be divided into several groups to discuss different topics. Compared with all participants discussing the same topic, grouping would result in higher overlap ratios. We control the above variables to cover the real scene comprehensively and evenly during the recording.

Chat-scenario Chinese Lipreading (ChatCLR) Challenge

Task 1: Wake Word Lipreading

Task 2: Target Speaker Lipreading

Scenario and Recording Setup