Background

With the emergence of many speech-enable applications, the scenarios (e.g., home and meeting) are becoming increasingly challenging due to the factors of adverse acoustic environments (far-field audio, background noises, and reverberations) and conversational multi-speaker interactions with a large portion of speech overlaps. The state-of-the-art speech processing techniques based on the single audio modality encounter the performance bottlenecks, e.g., yielding the word error rate of about 40% in CHiME-6 dinner party scenario.

Inspired by the finding that visual cues can help human speech perception, the multimodal information based speech processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios, using audio and video data. MISP challenge targets the home TV scenario, where several people are chatting in Chinese while watching TV in the living room. As the new features, the carefully selected far-field/mid-field/near-field microphone arrays and cameras are arranged to collect both audio and video data, respectively. Also the time synchronizations among different microphone arrays and video cameras are well designed for conducting the research on the multi-modality fusion.

With the success of MISP 2021 challenge, some advanced Audio-Visual Speech Recognition (AVSR) systems have been proposed. However, these systems use the oracle speaker diarization results, which greatly limits its scope in real-world applications. For the MISP2022 challenge, we target the problem of Audio-Visual Speaker Diarization (AVSD), and Audio-Visual Diarization and Recognition (AVDR) in the home-tv scenarios. Specifically, the AVDR is an extended task from AVSR, replacing oracle speaker diarization results with AVSD results.

How to leverage on both audio and video data to improve the environmental robustness is quite interesting in this challenge. The researchers from both academia and industry are warmly welcome to work on our two audio-visual tracks (with details as below) for promoting the research of speech processing using multimodal information to cross the practical threshold of realistic applications in challenging scenarios. All approaches are encouraged, whether they are emerging or established, and whether they rely on signal processing or machine learning.

The intellectual property (IP) is not transferred to the challenge organizers, i.e., if code is shared/submitted, the participants remain the owners of their code.

Scenario

Schematic Diagram
(a) Schematic Diagram
Real Shot
(b) Real Shot
Fig.1. Recording Scene

We consider the following scenario: several people are chatting while watching TV in the living room, and they can interact with a smart speaker/TV. With the multimodal data collected by microphones and cameras, we can conduct the research to solve the following two speech processing tracks:

  1. Audio-Visual Speaker Diarization
  2. Audio-Visual Diarization and Recognition

Recording Setup

Figure2:A real shot of recording in home TV scenario
Fig.2. Recording Devices

According to the distance between the device and the speaker, multiple recording devices were divided into 3 categories:

  1. Far devices: a linear microphone array (6 mic, 16 kHz, 16-bit) and a wide-angle camera (1080p, 25 fps, 2pi/3), which are placed 3-5m away from the speaker. Each microphone is an omnidirectional microphone. The distance between adjacent microphones is 35mm. The detailed parameters of the wide-angle camera are shown in the table. The linear microphone array is fixed on the top of the wide-angle camera, parallel to the x-axis of the camera coordinate system, and the midpoint coincides with the origin of the camera coordinate system. All participants appear in the camera, which brings speakers position information while reducing the resolution of the lip region of interest (ROI);
    Sensor type 200W CMOS Sensor
    Pixel size 3 um * 3 um
    Focus plane 1/2.7"
    Focal length 2.8 mm
    Aperture F1.8
    Field of view D=141° H=120° V=63°
    Resolution 1920*1080p
    Frame rate 25 fps
    Tab 1. Parameters of the far wide-angle camera
  2. Middle devices: a linear microphone array (2 mic, 44.1 kHz, 16-bit) and n high-definition cameras (720p, 25fps, pi/2), which are placed 1-1.5m away from the speaker, where n is the number of participants within this conversation. Two microphones are omnidirectional microphones. The distance between two microphones is 80 mm. The detailed parameters of the high-definition camera are shown in the table. There is no relationship between the position of the linear microphone array and the high-definition camera. There is only the corresponding speaker in each camera, the lip ROI is recorded clearly;
    Sensor type 200W CMOS Sensor
    Pixel size 3.75 um * 3.75 um
    Focus plane 1/3"
    Focal length 3 mm
    Aperture F: 1.5
    Field of view D=116° H=99° V=53.4°
    Resolution 1280*720p
    Frame rate 25 fps
    Tab 2. Parameters of the middle high-definition camera
  3. Near devices: n high-fidelity microphones (44.1 kHz, 16-bit), which were stuck in the middle of the corresponding speaker's chin, respectively. The collected audio signal is rarely interfered by the off-target source and the SNR is estimated to be greater than 15 db. This provides a guarantee for high-quality manual transcription.
  4. Various devices have resulted in inconsistent clocks. We address this from two aspects: synchronization devices and manual post-processing.

  5. Synchronization devices: the sound card (ZOOM F8n) is used to synchronize the clock of the middle linear microphone array and the clocks of near high-fidelity microphones while Vicando software, running on the industrial PC (MIC-770), is used to synchronize the clocks of all cameras.

Even if synchronization devices were used, there are still 3 different clocks, i.e. the clock of the sound card, the clock of the far linear microphone array and the clock of the industrial PC. They are synchronized by finding the mark point manually. A specific behavior, i.e. knocking the cup, would be done while the recording is started. The visual frame where the cup wall and the cup cover are in contact and the waveform point which is corresponding to the impact sound are aligned manually