The MISP 2025 challenge has been accepted by Interspeech 2025 Grand Challenge!
In recent years, the proliferation of speech-enabled applications has led to increasingly complex usage scenarios, such as home environments and meetings. Previous multimodal information-based speech processing (MISP) challenges in 2021, 2022, and 2023 targeted the home scenario, where several people converse in Chinese while watching TV in a living room. A large-scale audio-visual Chinese home conversational corpus was released to support multiple audio-visual speech processing tasks, including wake word spotting, target speaker extraction, speaker diarization, and speech recognition. These MISP challenges have attracted extensive participation with over 150 teams downloading the dataset, more than 60 teams actively submitting their results, and 15 research papers presented at ICASSP in 2022, 2023, and 2024.
Meetings are among the most valuable yet challenging contexts for speech applications due to their rich information exchange and decision-making processes. Accurate transcription and analysis are crucial for enhancing productivity and preserving insights, but this task is difficult due to varied speech styles and complex acoustic conditions. Current state-of-the-art audio-only techniques are hitting performance plateaus. For example, in the AliMeeting scenarios, the best performances achieved a character error rate (CER) of approximately 20%, which is inadequate for many real-world applications. The McGurk effect and subsequent studies have shown that visual cues can improve speech perception in noisy environments. Thus, the MISP 2025 challenge aims to advance meeting transcription techniques by incorporating multimodal information, such as video. The specific tasks are as follows:
1) Audio-Visual Speaker Diarization.
2) Audio-Visual Speech Recognition.
3) Audio-Visual Diarization and Recognition.
The following resources will be provided:
For additional information, please email us at mispchallenge@gmail.com.