Background

Speech-enabled systems encounter performance degradation in real-world scenarios due to adverse acoustic conditions and interactions among multiple speakers. Effective front-end speech processing technology has been proven to have a significant role in improving the performance of the back-end system. In recent years, there have been several challenges organized to explore the performance of front-end technologies. However, it is worth noting that most existing challenges rely on the audio modality alone, which is running into performance plateaus. Inspired by the finding that visual cues can help human speech perception, we hope to explore more effective methods using audio and video information jointly to improve the performance of front-end technology. Therefore, the MISP 2023 Challenge focuses on audio-visual front-end technology.

For front-end speech processing, methods such as speaker diarization, blind source separation, and speech enhancement are commonly utilized to provide the subsequent backend tasks with higher-quality audio or audio information. Target Speaker extraction (TSE) is an alternative to blind source separation and speech enhancement, which can selectively isolate the target speaker’s voice from a mixture of various speakers and background noise, leveraging auxiliary cues that help identify the target. Previous psychoacoustic studies have inspired the exploration of different clues, such as spatial aids for determining the direction of the target speaker, visual prompts acquired via video of their face, and audio cues from prerecorded enrollment recordings of the speaker’s voice. Addressing the TSE problem has practical implications for various applications, such as (1) developing robust voice user interfaces and voice-controlled smart devices that respond exclusively to a particular user, (2) removal of interfering nearby speakers in teleconferencing systems, and (3) amplifying the voice of a preferred speaker with hearing aids/hearables. TSE has received much attention in recent years. For instance, it was included as a separate challenge in recent evaluation campaigns, such as the 4th and 5th Deep Noise Suppression (DNS) challenge. Previous challenges only focused on audio cues, mainly by providing prerecorded enrollment recordings of the speaker’s voice. However, these cues are challenging to obtain under real-life harsh acoustic conditions. Conversely, video cues are relatively unaffected by acoustic interference. As a result, an increasing number of researches focus on audio-visual target speaker extraction (AVTSE) utilizing video cues. Unfortunately, no publicly available benchmark currently exists for AVTSE.

How to better utilize the information of audio and video information is of great importance. In the previous MISP challenges, the effectiveness of combining audio and video has been verified on many tasks. In the ICASSP 2022 SPGC, the first MISP challenge released a large distant multi-microphone conversational Chinese audio-visual corpus and focused on audio-visual wake word spotting and audio-visual speech recognition with oracle speaker diarization. Furthermore, in the ICASSP 2023 SPGC, the second MISP challenge explored the impact of audio-visual speaker diarization on speech recognition. Many participants have proposed several novel methods to improve the performance of audio-visual speech processing systems. We have communicated with the participants and they unanimously believe that an excellent front-end system can greatly help with back-end tasks. Therefore, this year, MISP 2023 challenge focuses the problem of AVTSE, aiming to promote the development of audio-visual front-end technology and explore the impact of the front-end technology on the back-end system. We hope this is a scientific issue aimed at exploring the performance of AVTSE in real-world scenarios with high noise and reverberation.

Schematic Diagram
Fig.1. An overview of the AVTSE task in MISP 2023 challenge

As shown in figure 1, in this challenge, we need to tackle mixture speech from complex home scenarios, which include strong background noise and high overlap ratios due to the interfering speakers. Participants need to use the 6-channels far-field audio, the middle-field video, and the oracle speaker diarization result to extract the target speaker’s speech through the AVTSE model, eliminating noise and interference from other speakers. All speakers in a session are treated as the target speakers separately.

Specifically, we target the home TV scenario, where several people chat in Chinese while watching TV in the living room. Unlike most previous audio-visual challenges, which are still limited to conducting research on the datasets under ideal condition, the MISP challenge is based on real-world scenarios. The strong background noise and reverberation, a large number of overlapping speech, and the possible blurry videos are the challenges of this audiovisual dataset. This year, different from the previous MISP dataset, we will conduct data cleansing on the training set for front-end tasks. We use the deep noise suppression mean opinion score (DNSMOS) to screen the segments with high speech quality for near-field audio. These data can be used as clean reference speech for model training by participants. To ensure complete synchronization of near-field audio and far-field audio, we will provide a data simulation script that allows participants to utilize the screened near-field speech to simulate far-field scenarios easily. We will also add some new sessions in the evaluation set, focusing on female dialogue scenarios. Our research has revealed that distinguishing speaker voices in these particular scenarios is more challenging, thus increasing the difficulty level for participants. In addition, we will provide the timestamps of each speaker’s speech, which contestants can use to segment the audio. Furthermore, a well-designed novel baseline system of AVTSE will be provided and we will also provide a pretrained model for automatic speech recognition (ASR) as the backend. Participants must develop an AVTSE system and input the extracted speech from AVTSE into the given ASR model. Please note that the ASR system is fixed and its model structure and parameters cannot be changed. Nevertheless, it is allowed to jointly optimize the front-end system and back-end ASR system while keeping the ASR model parameters unchanged. To explore the impact of the front-end technology on the back-end system, we will evaluate the performance based on the final character error rate (CER).

Through this challenge, we look forward to more academic and industrial researchers paying attention to the audio-visual front-end technology, especially AVTSE. This provides a new and important solution for solving the problem of performance degradation in complex acoustic scenes. We anticipate that this challenge will foster innovative approaches and advancements in multimodal speech processing.

The intellectual property (IP) is not transferred to the challenge organizers, i.e., if code is shared/submitted, the participants remain the owners of their code.

Challenge Features

  • The first challenge for AVTSE system in real-world application scenario
  • Exploring the impact of AVTSE system on back-end ASR system
  • Simultaneous recordings from multiple microphone arrays and video cameras
  • Real conversation, i.e., talkers speaking in a relaxed and unscripted fashion
  • 30+ real room acoustics and 250+ native Chinese, speaking Mandarin without strong accents
  • High overlaps ratios (20%-40%) in multi-talker conversions
  • Real domestic noise backgrounds, e.g., TV, air conditioning, movement, etc.
  • Strong reverberation that varies depending on different rooms

Scenario

We consider the scenario: 2-6 people communicate with each other with TV noise in the background. Multimodal data was collected by microphones and cameras. We aim to support research on audio-visual speech processing in the Home TV scene by providing the large-scale audio-visual corpus of multi-speaker conversational speech recorded via multi-microphone hardware in real living rooms.

An example of the recording scene is shown in Fig.2. Subfigure a is a schematic diagram. Six participants are chatting while multiple devices are used to record audio and video in parallel. Subfigure b is a real shot of the recording scene.

Schematic Diagram
(a) Schematic Diagram
Real Shot
(b) Real Shot
Fig.2. Recording Scene

There still are some variables in the conversation that is taking place in the real living room. For example, the TV is turned on/off, the conversation is happening during the day or night, etc. Specifically, by observing the real conversations in the real living room, we found that participants would be divided into several groups to discuss different topics. Compared with all participants discussing the same topic, grouping would result in higher overlap ratios. We control the above variables to cover the real scene comprehensively during the recording.

Recording Setup

Figure2:A real shot of recording in home TV scenario
Fig.3. Recording Devices

According to the distance between the device and the speaker, multiple recording devices were divided into 3 categories:

  1. Far devices: a linear microphone array (6 mic, 16 kHz, 16-bit) and a wide-angle camera (1080p, 25 fps, 2pi/3), which are placed 3-5m away from the speaker. Each microphone is an omnidirectional microphone. The distance between adjacent microphones is 35mm. The detailed parameters of the wide-angle camera are shown in the table. The linear microphone array is fixed on the top of the wide-angle camera, parallel to the x-axis of the camera coordinate system, and the midpoint coincides with the origin of the camera coordinate system. All participants appear in the camera, which brings speakers position information while reducing the resolution of the lip region of interest (ROI);
    Sensor type 200W CMOS Sensor
    Pixel size 3 um * 3 um
    Focus plane 1/2.7"
    Focal length 2.8 mm
    Aperture F1.8
    Field of view D=141° H=120° V=63°
    Resolution 1920*1080p
    Frame rate 25 fps
    Tab 1. Parameters of the far wide-angle camera
  2. Middle devices: a linear microphone array (2 mic, 44.1 kHz, 16-bit) and n high-definition cameras (720p, 25fps, pi/2), which are placed 1-1.5m away from the speaker, where n is the number of participants within this conversation. Two microphones are omnidirectional microphones. The distance between two microphones is 80 mm. The detailed parameters of the high-definition camera are shown in the table. There is no relationship between the position of the linear microphone array and the high-definition camera. There is only the corresponding speaker in each camera, the lip ROI is recorded clearly;
    Sensor type 200W CMOS Sensor
    Pixel size 3.75 um * 3.75 um
    Focus plane 1/3"
    Focal length 3 mm
    Aperture F: 1.5
    Field of view D=116° H=99° V=53.4°
    Resolution 1280*720p
    Frame rate 25 fps
    Tab 2. Parameters of the middle high-definition camera
  3. Near devices: n high-fidelity microphones (44.1 kHz, 16-bit), which were stuck in the middle of the corresponding speaker's chin, respectively. The collected audio signal is rarely interfered by the off-target source and the SNR is estimated to be greater than 15 db. This provides a guarantee for high-quality manual transcription.
  4. Various devices have resulted in inconsistent clocks. We address this from two aspects: synchronization devices and manual post-processing.

  5. Synchronization devices: the sound card (ZOOM F8n) is used to synchronize the clock of the middle linear microphone array and the clocks of near high-fidelity microphones while Vicando software, running on the industrial PC (MIC-770), is used to synchronize the clocks of all cameras.

Even if synchronization devices were used, there are still 3 different clocks, i.e. the clock of the sound card, the clock of the far linear microphone array and the clock of the industrial PC. They are synchronized by finding the mark point manually. A specific behavior, i.e. knocking the cup, would be done while the recording is started. The visual frame where the cup wall and the cup cover are in contact and the waveform point which is corresponding to the impact sound are aligned manually.