Instructions
Task 3 aims to directly solve the ``who spoke what when" problem and can be seen as a combination of Task 1 and 2. Of course, end-to-end systems are also important. The evaluation set is the same as Task 2, but using ground truth diarization information is prohibited.
Evaluation
With reference to the concatenated minimum-permutation word error rate (cpWER) in, we use concatenated minimum-permutation character error rate (cpCER) as the evaluation criterion in Task 3. The calculation of cpCER in a session is divided into three steps:
- Recognition results and reference transcriptions belonging to the same speaker are concatenated on the timeline in a session.
- CERs between the reference and all possible speaker permutations of the hypothesis \(\{\bm{s}_i | i = 0, 1, \cdots, \mathrm{P}_{N_{\rm spk}}^{N_{\rm spk}}\}\) are calculated as Eq.CER), where \(N_{\rm spk}\) is the total number of speakers in the session.
- The lowest CER as the cpCER, the process is described as follows:
\[
{\rm cpCER} = \min_{\{\bm{s}_i | i = 0, 1, \cdots, \mathrm{P}_{N_{\rm spk}}^{N_{\rm spk}}\}}{\rm CER}_i
\]