Audio-visual diarization and recognition (AVDR) focus on addressing “who spoken what when” using audio and video data. Different from previous challenge, the track 2 of MISP2022 Challenge is an extended track from audio-visual speech recognition (AVSR) in MISP2021 by replacing oracle speaker diarization results with audio-visual speaker diarization (AVSD) results.
In this Challenge, we adopt concatenated minimum permutation character error rate (cpCER) as an official metric for our ranking. The calculation of cpCER is divided into three steps.
In this Challenge, we adopt concatenated minimum permutation character error rate (cpCER) as an official metric for our ranking. The calculation of cpCER is divided into three steps.
where:
Due to the problem of permutation invariant training (PIT) and annotated segment text correspondence, we adopt cpCER as the final evaluation metric. The lower cpCER value (with 0 being a perfect score), the better the diarization and recognition performance.
For training and development test sets, we will prepare the scoring script which will be released together with the baseline. For the evaluation test set, the participant should submit a text file for each session to the Codelab platform which contains recognition results. And cpCER will be calculated and updated in the leaderboard.
We restrict the rules of additional data usage. External audio data and video data are allowed to be used. Significantly, participants can utilize timestamps, speaker tags, and other information except for text contents.
But use of external slient video data is allowed in the pretrained model training under the following conditions:
We hope participants will pay more attention to technological innovation, especially novel model architectures, instead of relying on using more data. This is not a pure competition, but a “scientific” challenge activity. There are similar rules in CHiME-5/6 challenge, too.
You can use the following annotations for training, development, and evaluation:
For training and development, you can use the full-length recordings of all recording devices. For evaluation, you are allowed to use for a given utterance the full-length recordings of far-field devices (both the linear 6 microphones array and the wide-angle camera) for that session.
Manual modification of the data or the annotations (e.g., manual refinement of the utterance start and end times) is forbidden. All parameters should be tuned on the training set or the development set. Modifications of the development set are allowed, provided that its size remains unchanged and these modifications do not induce the risk of inadvertently biasing the development set toward the particular speakers or acoustic conditions in the evaluation set. For instance, enhancing the signals, applying “unbiased” transformations or automatically refining the utterance start and end times is allowed. Augmenting the development set by generating simulated data, applying biased signal transformations (e.g., systematically increasing intensity/pitch), or selecting a subset of the development set is forbidden. In case of doubt, please ask us ahead of the submission deadline.
Again, you are entirely free in the development of your system.
In particular, you can:
For every tested system, you should report cpCER in the evaluation set.