Instructions

Task 3 aims to directly solve the ``who spoke what when" problem and can be seen as a combination of Task 1 and 2. Of course, end-to-end systems are also important. The evaluation set is the same as Task 2, but using ground truth diarization information is prohibited.

Evaluation

With reference to the concatenated minimum-permutation word error rate (cpWER) in, we use concatenated minimum-permutation character error rate (cpCER) as the evaluation criterion in Task 3. The calculation of cpCER in a session is divided into three steps:

  • Recognition results and reference transcriptions belonging to the same speaker are concatenated on the timeline in a session.
  • CERs between the reference and all possible speaker permutations of the hypothesis \(\{\bm{s}_i | i = 0, 1, \cdots, \mathrm{P}_{N_{\rm spk}}^{N_{\rm spk}}\}\) are calculated as Eq.CER), where \(N_{\rm spk}\) is the total number of speakers in the session.
  • The lowest CER as the cpCER, the process is described as follows:
  • \[ {\rm cpCER} = \min_{\{\bm{s}_i | i = 0, 1, \cdots, \mathrm{P}_{N_{\rm spk}}^{N_{\rm spk}}\}}{\rm CER}_i \]