Audio-visual speaker diarization aims to solve the ``who spoke when'' problem by labeling speech timestamps with classes corresponding to speaker identity using multi-speaker audio and video data. The training and development sets provide all audio and video recordings and the corresponding ground-truth segmentation timestamp. In contrast, the evaluation set will not include the near-field speech or transcriptions. Participants need to determine the speaker at each time point.
Diarization error rate (DER) is adopted as the evaluation criterion. The lower the DER value (with 0 being a perfect score), the higher the ranking. DER is calculated as: the summed time of three different errors of speaker confusion (SC), false alarm (FA) and missed detection (MD) divided by the total duration time, as shown in
where \( T_{\rm SC} \), \(T_{\rm FA}\) and \(T_{\rm MD}\) are the time duration of the three errors, and \(T_{\rm total}\) is the total time duration. It is worth noting that we do not set the ``no score'' collar, and overlapping speech will be evaluated.