Background & Task Overview

With the emergence of many speech-enable applications, the scenarios (e.g., home and meeting) are becoming increasingly challenging due to the factors of adverse acoustic environments (far-field audio, background noises, and reverberations) and conversational multi-speaker interactions with a large portion of speech overlaps. The state-of-the-art speech processing techniques based on the single audio modality encounter the performance bottlenecks, e.g., yielding the word error rate of about 40% in CHiME-6 dinner party scenario. Motivated by this, the MISP challenge aims to tackle these problems by introducing additional modality information (such as video or text), yielding better environmental and speaker robustness in realistic applications.

For the first MISP challenge, we target the home TV scenario, where several people are chatting in Chinese while watching TV in the living room and they can interact with a smart speaker/TV. As the new features, the carefully selected far-field/mid-field/near-field microphone arrays and cameras are arranged to collect both audio and video data, respectively. Also the time synchronizations among different microphone arrays and video cameras are well designed for conducting the research on the multi-modality fusion. The challenge considers the problem of distant multi-microphone conversational audio-visual wake-up and audio-visual speech recognition in everyday home environments. How to leverage on both audio and video data to improve the environmental robustness is quite interesting. The researchers from both academia and industry are warmly welcome to work on our two audio-visual tasks (with details as below) for promoting the research of speech processing using multimodal information to cross the practical threshold of realistic applications in challenging scenarios. All approaches are encouraged, whether they are emerging or established, and whether they rely on signal processing or machine learning.

Scenario

We consider the following scenario: several people are chatting while watching TV in the living room, and they can interact with a smart speaker/TV. With the multimodal data collected by microphones and cameras, we can conduct the research to solve the following two speech processing tasks:

  1. Audio-Visual Wake Word Spotting
  2. Audio-Visual Speech Recognition with Oracle Speaker Diarization

All two tasks suffer from performance degradations yielded by adverse acoustic conditions in the Home TV scene described above and the final recognition result is seriously distorted. We proposed the visual modality could be a powerful supplement input for anyone system.

Schematic Diagram
(a) Schematic Diagram
Real Shot
(b) Real Shot
Fig.1. Recording Scene

We aim to support the research about audio-visual speech processing in the Home TV scene by providing the first large-scale audio-visual corpus of multi-speaker conversational speech recorded via multi-microphone hardware in real living rooms.

An example of the recording scene is shown in Fig.1. Subfigure a is a schematic diagram, six participants are chatting while multiple devices are used to record audio and video in parallel. Subfigure b is a real shot of the recording scene.

There still are some variables in the conversation that is taking place in real living room, for example, the TV is turned on/off, the conversation is happened during the day or night, etc. Specifically, by observing the real conversations these were taking place in real living room, we found that the participants would be divided into several groups to discuss different topics. Compared with all participants discussing the same topic, grouping would result in higher overlap ratios. We control the above variables to cover the real scene comprehensively and evenly during the recording.

Recording Setup

Figure2:A real shot of recording in home TV scenario
Fig.2. Recording Devices

According to the distance between the device and the speaker, multiple recording devices were divided into 3 categories:

  1. Far devices: a linear microphone array (6 mic, 16 kHz, 16-bit) and a wide-angle camera (1080p, 25 fps, 2pi/3), which are placed 3-5m away from the speaker. Each microphone is an omnidirectional microphone. The distance between adjacent microphones is 35mm. The detailed parameters of the wide-angle camera are shown in the table. The linear microphone array is fixed on the top of the wide-angle camera, parallel to the x-axis of the camera coordinate system, and the midpoint coincides with the origin of the camera coordinate system. All participants appear in the camera, which brings speakers position information while reducing the resolution of the lip region of interest (ROI);
    Sensor type 200W CMOS Sensor
    Pixel size 3 um * 3 um
    Focus plane 1/2.7"
    Focal length 2.8 mm
    Aperture F1.8
    Field of view D=141° H=120° V=63°
    Resolution 1920*1080p
    Frame rate 25 fps
    Tab 1. Parameters of the far wide-angle camera
  2. Middle devices: a linear microphone array (2 mic, 44.1 kHz, 16-bit) and n high-definition cameras (720p, 25fps, pi/2), which are placed 1-1.5m away from the speaker, where n is the number of participants within this conversation. Two microphones are omnidirectional microphones. The distance between two microphones is 80 mm. The detailed parameters of the high-definition camera are shown in the table. There is no relationship between the position of the linear microphone array and the high-definition camera. There is only the corresponding speaker in each camera, the lip ROI is recorded clearly;
    Sensor type 200W CMOS Sensor
    Pixel size 3.75 um * 3.75 um
    Focus plane 1/3"
    Focal length 3 mm
    Aperture F: 1.5
    Field of view D=116° H=99° V=53.4°
    Resolution 1280*720p
    Frame rate 25 fps
    Tab 2. Parameters of the middle high-definition camera
  3. Near devices: n high-fidelity microphones (44.1 kHz, 16-bit), which were stuck in the middle of the corresponding speaker's chin, respectively. The collected audio signal is rarely interfered by the off-target source and the SNR is estimated to be greater than 15 db. This provides a guarantee for high-quality manual transcription.
  4. Various devices have resulted in inconsistent clocks. We address this from two aspects: synchronization devices and manual post-processing.

  5. Synchronization devices: the sound card (ZOOM F8n) is used to synchronize the clock of the middle linear microphone array and the clocks of near high-fidelity microphones while Vicando software, running on the industrial PC (MIC-770), is used to synchronize the clocks of all cameras.

Even if synchronization devices were used, there are still 3 different clocks, i.e. the clock of the sound card, the clock of the far linear microphone array and the clock of the industrial PC. They are synchronized by finding the mark point manually. A specific behavior, i.e. knocking the cup, would be done while the recording is started. The visual frame where the cup wall and the cup cover are in contact and the waveform point which is corresponding to the impact sound are aligned manually