Introduction

In recent years, the proliferation of speech-enabled applications has led to increasingly complex usage scenarios, such as home environments and meetings. Previous multimodal information-based speech processing (MISP) challenges in 2021, 2022, and 2023 targeted the home scenario, where several people converse in Chinese while watching TV in a living room. A large-scale audio-visual Chinese home conversational corpus was released to support multiple audio-visual speech processing tasks, including wake word spotting, target speaker extraction, speaker diarization, and speech recognition. These MISP challenges have attracted extensive participation with over 150 teams downloading the dataset, more than 60 teams actively submitting their results, and 15 research papers presented at ICASSP in 2022, 2023, and 2024.

Meetings are among the most valuable yet challenging contexts for speech applications due to their rich information exchange and decision-making processes. Accurate transcription and analysis are crucial for enhancing productivity and preserving insights, but this task is difficult due to varied speech styles and complex acoustic conditions. Current state-of-the-art audio-only techniques are hitting performance plateaus. For example, in the AliMeeting scenarios, the best performances achieved a character error rate (CER) of approximately 20%, which is inadequate for many real-world applications. The McGurk effect and subsequent studies have shown that visual cues can improve speech perception in noisy environments. Thus, the MISP 2025 challenge aims to advance meeting transcription techniques by incorporating multimodal information, such as video. The specific tasks are as follows:

1) Audio-Visual Speaker Diarization.

2) Audio-Visual Speech Recognition.

3) Audio-Visual Diarization and Recognition.

The following resources will be provided:

A large-scale audio-visual meeting corpus, MISP-Meeting, and a comprehensive baseline system.

A public benchmark for fair comparisons

A challenge session to foster communication.

A summary publication highlighting the most effective techniques and promising directions.

Planned Schedule(AOE Time)

Registration opens and training set release: November 17, 2024

Development set release: November 20, 2024

Baseline release: November 24, 2024

Registration closes, evaluation set release and leaderboard update for evaluation set: January 3, 2025

Leaderboard freeze: January 24, 2025

System report submission: February 10, 2025

Final paper submission: February 12, 2025

Multimodal Information Based Speech Processing (MISP) 2025 Challenge

News

Introduction

Planned Schedule(AOE Time)

Organizers

Jun Du

Chin-Hui LEE

Jingdong Chen

Shinji Watanabe

Sabato Marco Siniscalchi

Odette Scharenborg

Hang Chen

Contact Us