Track1 Instructions

Speaker diarization aims to address the “who spoke when” problem by labeling speech timestamps according to the identify of speakers. Different from previous challenges, the track 1 of MISP2022 Challenge focuses on speaker diarization using audio and video data, namely Audio-Visual Speaker Diarization (AVSD).

In the evaluation stage, we only use far-field data, including far-field video and 6-channels far-field audio. In addition, we will provide the oracle speech segmentation timestamp.

Evaluation

In this track, we adopt diarization error rate (DER) as the official metric for our ranking. It is represented with this formula:

where:

  • FA is the speech durations of false alarm
  • MISS is the speech durations of missed detection
  • SPKERR is the speech durations of speaker error
  • TOTAL is the sum of durations of all reference speakers’ speech

The lower the DER value (with 0 being a perfect score), the higher the ranking. It is worth noting that we do not set the “no score” collar, and the overlapping speech will be evaluated.

For training and development sets, we will prepare the scoring script which will be released together with the baseline. For the evaluation test set, the participant should submit a Rich Transcription Time Marked (RTTM) file for each session to the Codelab platform. DER will be calculated and updated in the leaderboard.

Can I use extra audio/video/text data?

In Track 1, external audio data can be used to train the AVSD model, such as VoxCeleb 1/2 , CN-Celeb, and other public datasets. Additional video data is also allowed to be used.

But use of external data is under the following conditions:

  • The used external resource is clearly referenced and freely accessible to any other research group in the world. External data refers to public datasets or trained models. The data must be public and freely available before 15th of January 2023.
  • The list of external data sources used in training must be clearly indicated in the technical report.
  • Participants inform the organizers in advance about such data sources, so that all competitors know about them and have an equal opportunity to use them. Please send an email to the track coordinators; we will update the list of external datasets on the web page accordingly. Once the evaluation set is published, the list of allowed external data resources is locked (no further external sources allowed).

We hope participants will pay more attention to technological innovation, especially audio and video fusion technology, instead of relying on using more data. This is not a pure competition, but a “scientific” challenge activity.

Which information can I use?

You can use the following annotations for training and development:

  • the corresponding room sizes
  • the corresponding configuration labels (as shown in Tab.1)
  • the corresponding speaker labels
  • the start and end times of all utterances

For training and development, you can use the full-length recordings of all recording devices. For evaluation, you are allowed to use for a given utterance the full-length recordings of far-field devices (both the linear 6 microphones array and the wide-angle camera) for that session.

Which information shall I not use?

Manual modification of the data or the annotations (e.g., manual refinement of the utterance start and end times) is forbidden. All parameters should be tuned on the training set or the development set. Modifications of the development set are allowed, provided that its size remains unchanged and these modifications do not induce the risk of inadvertently biasing the development set toward the particular speakers or acoustic conditions in the evaluation set. For instance, enhancing the signals, applying “unbiased” transformations or automatically refining the utterance start and end times is allowed. Augmenting the development set by generating simulated data, applying biased signal transformations (e.g., systematically increasing intensity/pitch), or selecting a subset of the development set is forbidden. In case of doubt, please ask us ahead of the submission deadline.

Can I use a AVSD system different from that in the official baseline pipeline?

There is no limitation on AVSD model structure and model training technology used by participants. You are entirely free in the development of your system.

In particular, you can:

  • use the single-channel or multi-channel audio data
  • use other video pre-processing methods
  • use other audio features and visual features
  • use other post-processing methods
  • modify the audio/visual/audio-visual model architecture or the training criterion

Which results should I report?

For every tested system, you should report DER in the evaluation set.