Track2 Instructions

Audio-visual diarization and recognition (AVDR) focus on addressing “who spoken what when” using audio and video data. Different from previous challenge, the track 2 of MISP2022 Challenge is an extended track from audio-visual speech recognition (AVSR) in MISP2021 by replacing oracle speaker diarization results with audio-visual speaker diarization (AVSD) results.

In this Challenge, we adopt concatenated minimum permutation character error rate (cpCER) as an official metric for our ranking. The calculation of cpCER is divided into three steps.

Evaluation

In this Challenge, we adopt concatenated minimum permutation character error rate (cpCER) as an official metric for our ranking. The calculation of cpCER is divided into three steps.

Recognition results and reference transcriptions belonging to the same speaker are concatenated on the timeline in a session.
The character error rate (CER) of permutations of speakers is calculated as follows:

where:

S = Number of Substitutions
D = Number of Deletions
I = Number of Insertions
N = Number of characters in ground truth

Select the lowest CER as the cpCER in the session.

Due to the problem of permutation invariant training (PIT) and annotated segment text correspondence, we adopt cpCER as the final evaluation metric. The lower cpCER value (with 0 being a perfect score), the better the diarization and recognition performance.

For training and development test sets, we will prepare the scoring script which will be released together with the baseline. For the evaluation test set, the participant should submit a text file for each session to the Codelab platform which contains recognition results. And cpCER will be calculated and updated in the leaderboard.

Can I use extra audio/video/text data?

We restrict the rules of additional data usage. External audio data and video data are allowed to be used. Significantly, participants can utilize timestamps, speaker tags, and other information except for text contents.

But use of external slient video data is allowed in the pretrained model training under the following conditions:

The used external resource is clearly referenced and freely accessible to any other research group in the world. External data refers to public datasets or trained models. The data must be public and freely available before 15th of January 2023.
The list of external data sources used in training must be clearly indicated in the technical report.
Participants inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email to the track coordinators; we will update the list of external datasets on the web page accordingly. Once the evaluation set is published, the list of allowed external data resources is locked (no further external sources allowed).

We hope participants will pay more attention to technological innovation, especially novel model architectures, instead of relying on using more data. This is not a pure competition, but a “scientific” challenge activity. There are similar rules in CHiME-5/6 challenge, too.

Which information can I use?

You can use the following annotations for training, development, and evaluation:

the corresponding room sizes
the corresponding configuration labels (as shown in Tab.1)
the corresponding speaker labels
the start and end times of all utterances

For training and development, you can use the full-length recordings of all recording devices. For evaluation, you are allowed to use for a given utterance the full-length recordings of far-field devices (both the linear 6 microphones array and the wide-angle camera) for that session.

Which information shall I not use?

Manual modification of the data or the annotations (e.g., manual refinement of the utterance start and end times) is forbidden. All parameters should be tuned on the training set or the development set. Modifications of the development set are allowed, provided that its size remains unchanged and these modifications do not induce the risk of inadvertently biasing the development set toward the particular speakers or acoustic conditions in the evaluation set. For instance, enhancing the signals, applying “unbiased” transformations or automatically refining the utterance start and end times is allowed. Augmenting the development set by generating simulated data, applying biased signal transformations (e.g., systematically increasing intensity/pitch), or selecting a subset of the development set is forbidden. In case of doubt, please ask us ahead of the submission deadline.

Can I use the AVDR system different from that in the official baseline pipeline?

Again, you are entirely free in the development of your system.

In particular, you can:

include a single-channel or multi-channel audio enhancement front-end
include a video pre-processing
use other acoustic features and visual features
modify the acoustic/visual/acoustic-visual model architecture or the training criterion
modify the lexicon and the language model
use any rescoring technique

Which results should I report?

For every tested system, you should report cpCER in the evaluation set.