Instructions

Audio-visual speaker diarization aims to solve the ``who spoke when'' problem by labeling speech timestamps with classes corresponding to speaker identity using multi-speaker audio and video data. The training and development sets provide all audio and video recordings and the corresponding ground-truth segmentation timestamp. In contrast, the evaluation set will not include the near-field speech or transcriptions. Participants need to determine the speaker at each time point.

Evaluation

Diarization error rate (DER) is adopted as the evaluation criterion. The lower the DER value (with 0 being a perfect score), the higher the ranking. DER is calculated as: the summed time of three different errors of speaker confusion (SC), false alarm (FA) and missed detection (MD) divided by the total duration time, as shown in

\[ {\rm DER} = \frac{T_{\rm SC}+T_{\rm FA}+T_{\rm MD}}{T_{\rm total}} \]

where \( T_{\rm SC} \), \(T_{\rm FA}\) and \(T_{\rm MD}\) are the time duration of the three errors, and \(T_{\rm total}\) is the total time duration. It is worth noting that we do not set the ``no score'' collar, and overlapping speech will be evaluated.