Speaker diarization aims to address the “who spoke when” problem by labeling speech timestamps according to the identify of speakers. Different from previous challenges, the track 1 of MISP2022 Challenge focuses on speaker diarization using audio and video data, namely Audio-Visual Speaker Diarization (AVSD).
In the evaluation stage, we only use far-field data, including far-field video and 6-channels far-field audio. In addition, we will provide the oracle speech segmentation timestamp.
In this track, we adopt diarization error rate (DER) as the official metric for our ranking. It is represented with this formula:
where:
The lower the DER value (with 0 being a perfect score), the higher the ranking. It is worth noting that we do not set the “no score” collar, and the overlapping speech will be evaluated.
For training and development sets, we will prepare the scoring script which will be released together with the baseline. For the evaluation test set, the participant should submit a Rich Transcription Time Marked (RTTM) file for each session to the Codelab platform. DER will be calculated and updated in the leaderboard.
In Track 1, external audio data can be used to train the AVSD model, such as VoxCeleb 1/2 , CN-Celeb, and other public datasets. Additional video data is also allowed to be used.
But use of external data is under the following conditions:
We hope participants will pay more attention to technological innovation, especially audio and video fusion technology, instead of relying on using more data. This is not a pure competition, but a “scientific” challenge activity.
You can use the following annotations for training and development:
For training and development, you can use the full-length recordings of all recording devices. For evaluation, you are allowed to use for a given utterance the full-length recordings of far-field devices (both the linear 6 microphones array and the wide-angle camera) for that session.
Manual modification of the data or the annotations (e.g., manual refinement of the utterance start and end times) is forbidden. All parameters should be tuned on the training set or the development set. Modifications of the development set are allowed, provided that its size remains unchanged and these modifications do not induce the risk of inadvertently biasing the development set toward the particular speakers or acoustic conditions in the evaluation set. For instance, enhancing the signals, applying “unbiased” transformations or automatically refining the utterance start and end times is allowed. Augmenting the development set by generating simulated data, applying biased signal transformations (e.g., systematically increasing intensity/pitch), or selecting a subset of the development set is forbidden. In case of doubt, please ask us ahead of the submission deadline.
There is no limitation on AVSD model structure and model training technology used by participants. You are entirely free in the development of your system.
In particular, you can:
For every tested system, you should report DER in the evaluation set.