Instructions

The challenge of AVSR is to handle overlapped segmentation and to recognize the content of multiple speakers. Meanwhile, the provided evaluation set is the same as Task 1 and the ground truth diarization information is avail. Finally, participants are required to transcribe each speaker.

Evaluation

The performance is measured by character error rate (CER). The CER compares, for a given hypothesis output, the total number of characters, including spaces, to the minimum number of insertions (Ins), substitutions (Subs) and deletions (Del) of characters that are required to obtain the reference transcript. Specifically, CER is calculated by:

\[ {\rm CER} = \frac{N_{\rm Subs} + N_{\rm Del} + N_{\rm Ins}}{N_{\rm total}} \times 100 \]

where \(N_{\rm Subs}\), \(N_{\rm Del}\) and \(N_{\rm Ins}\) are the character number of the three errors, respectively, and \(N_{\rm total}\) is the total number of characters.The lower the CER value (with 0 being a perfect score), the better the recognition performance. For such speech overlap segments, we calculate all errors based on the recognition results and the ground truth for each speaker based on the oracle speaker diarization results.